Optimizing OR (union) operations in MySQL

MySQL Open Source Web development October 15th, 2008 by Eran Galperin

In my last post on database optimization, I focused on improving query performance by optimizing schema - exploring indexing strategies by reading the execution plan. In this post I'll show how different query structures can also have a major impact on performance.

Dealing with OR operators in a WHERE clause of an SQL statement in MySQL can be tricky. Up until recently, MySQL could only use one index per table referenced in a query. A multi-column index can be used for  filtering conditions with an AND operator (which is more restrictive by nature), but a condition added by OR must use a separate index because of the logical nature of the opertaor (a union, as opposed to an intersection that the AND represents).

MySQL 5.0 added the index_merge select type, which allows the query optimizer to possibly select several indexes from a single table and merge them to improve query performance. I say possibly, since leaving such decisions to the optimizer is risky at best. In fact, as I will show next, you are sometimes left with no indexes selected out of several possible options, resulting in a full table scan.

Continuing from my last post, I'll use a real-world example to show the different paths the queries optimizer can take when preparing an execution plan for our queries.

I'll actually be working with the same query I profiled last time, with a minor change. The relevant table structure is as follows:

--
-- Table structure for table `tasks`
--
 
CREATE TABLE IF NOT EXISTS `tasks` (
  `id` int(13) NOT NULL AUTO_INCREMENT,
  `list_id` int(13) NOT NULL,
  `user_id` int(13) NOT NULL,
  `task` varchar(255) collate utf8_unicode_ci NOT NULL,
  `due` timestamp NULL DEFAULT NULL,
  `created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `checked` timestamp NULL DEFAULT NULL,
  `assigned` int(13) DEFAULT NULL,
  `done` enum('1') collate utf8_unicode_ci DEFAULT NULL,
  PRIMARY KEY  (`id`),
  KEY `user_id` (`user_id`,`due`),
  KEY `assigned` (`assigned`,`due`),
  KEY `due` (`due`)
) ENGINE=InnoDB

And the query itself:

SELECT `tasks`.`id`,
           `tasks`.`task`,
           UNIX_TIMESTAMP(due) AS `time`,
           `lists`.`name` AS `list_name`,
           `lists`.`id` AS `list_id`
FROM `tasks`
INNER JOIN `lists` ON lists.id=tasks.list_id
WHERE (tasks.user_id='1' OR assigned='1')
     AND (tasks.done IS NULL)
     AND (tasks.due IS NOT NULL)
ORDER BY `tasks`.`due` ASC

This query is almost the same as the one I worked with last time. The only difference being the first statement in the WHERE clause, involving an OR operator:

(tasks.user_id='1' OR assigned='1')

Just to understand what I'm doing here - The query selects from the tasks table with several filtering criteria. The last OR statement conveys the condition that the tasks selected either belong to specific user (i.e created by him) or assigned to him (those two possibly coincide).

For testing purposes I will be running the queries against a testing database I set up with plenty of mock data. The database is around 1Gb in total size, with the tasks table at about 1.8 million rows. It's not very large, but enough for significant data to be obtained while allowing relatively online tampering with schema (modifying keys takes only around 4 minutes to complete).

Running in original form computes as (average of 10 queries):

(304 total, Query took 6.84 sec)

6.84 sec is way too long for running a query in a typical web application. If you'd recall, I last got this query running at 0.0008 sec without the additional OR condition, meaning it is running at around 7,000 times slower (yikes). In such a case you would suspect the indexes are not selective enough, and sure enough an EXPLAIN reveals:

id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE tasks index user_id,assigned,due due 5 NULL 1875432 Using where
1 SIMPLE lists eq_ref PRIMARY PRIMARY 4 tasks.list_id 1

MySQL is performing a regular index select and not an index_merge as we'd like, and doing that it selects the worst possible index - the only one that doesn't filter the result set. Sure enough, all 1.8M rows are scanned and the query is underperforming badly. Trying to force the issue, I first add an IGNORE INDEX(due) to remove it from the equation:

(304 total, Query took 1.41 sec)

id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE tasks ALL user_id,assigned NULL NULL NULL 1870838 Using where; Using filesort
1 SIMPLE lists eq_ref PRIMARY PRIMARY 4 tasks.list_id 1

The query optimizer has decided not to use an index at all, despite several good candidates. However, query performance is improved since a full table scan with no index is performed straight up. It's still way too slow, and we'd like it to use a filtering index so it can avoid a full table scan. Trying to force the other indexes doesn't work so we're left with no choice but to try alternative query structures.

Our first candidate is replacing our OR condition with two UNION'ed select statements (the inspiration is from a couple of posts over at the MySQL performance blog). Breaking the original query into a UNION form results in:

(SELECT `tasks`.`id`,
            `tasks`.`task`,
            UNIX_TIMESTAMP(due) AS `time`,
            `lists`.`name` AS `list_name`,
            `lists`.`id` AS `list_id`
FROM `tasks`
INNER JOIN `lists` ON lists.id=tasks.list_id
WHERE (tasks.done IS NULL)
    AND (tasks.due IS NOT NULL)
    AND (tasks.user_id='1')
) UNION ALL (
 SELECT `tasks`.`id`,
            `tasks`.`task`,
            UNIX_TIMESTAMP(due) AS `time`,
            `lists`.`name` AS `list_name`,
            `lists`.`id` AS `list_id`
FROM `tasks`
INNER JOIN `lists` ON lists.id=tasks.list_id
WHERE (tasks.done IS NULL)
    AND (tasks.due IS NOT NULL)
    AND (assigned='1' AND tasks.user_id!='1')
) ORDER BY `time` ASC
 

This unsightly looking query looks much more complex, but actually gets the job done:

(304 total, Query took 0.0021 sec)

A big improvement to say the least (~3500 faster than the original query).
The explain shows why:

1 PRIMARY tasks range user_id,due user_id 9 NULL 304 Using where
1 PRIMARY lists eq_ref PRIMARY PRIMARY 4 tasks.list_id 1
2 UNION tasks range user_id,assigned,due assigned 10 NULL 1 Using where
2 UNION lists eq_ref PRIMARY PRIMARY 4 tasks.list_id 1
NULL UNION RESULT <union1,2> ALL NULL NULL NULL NULL NULL Using filesort

Only 304 rows are scanned (as opposed to the entire 1.8M table before). Despite the filesort, we are in the area of usability for this query.

Another approach would be to determine why the index_merging isn't take place and how can we enforce it. By reducing the WHERE clause to include only the columns covered by an index, index_merging kicks in resulting in a similar performance as the union, albeit with less selective filtering:

(368 total, Query took 0.0018 sec)

We could filter our result set in the application, but I'd like to keep it in the SQL where it belongs. In order to integrate the rest of the filtering statements, I use an IN condition on another index I have available - the primary key. This results in:

SELECT `tasks`.`id`,
           `tasks`.`task`,
           UNIX_TIMESTAMP(due) AS `time`,
           `lists`.`name` AS `list_name`,
           `lists`.`id` AS `list_id`
FROM `tasks`
INNER JOIN `lists` ON lists.id=tasks.list_id
WHERE (tasks.user_id='1' OR assigned='1')
   AND tasks.id IN
     (SELECT id
      FROM tasks
      WHERE (tasks.done IS NULL)
        AND (tasks.due IS NOT NULL)
     )
ORDER BY `tasks`.`due` ASC
 

Which brings us back to the performance levels of the UNION form:

(304 total, Query took 0.0021 sec)

And we can see in the EXPLAIN that it is indeed an index_merge operation:

id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY tasks index_merge user_id,assigned user_id,assigned 4,5 NULL 369 Using sort_union(user_id,assigned); Using where; Using filesort
1 PRIMARY lists eq_ref PRIMARY PRIMARY 4 tasks.list_id 1
2 DEPENDENT SUBQUERY tasks unique_subquery PRIMARY,due PRIMARY 4 func 1 Using where

So there you have it. From 6.83 seconds to 0.0021 seconds with some tweaking to the query structure. I am still not completely satisfied with the fact that it's using filesort, but I couldn't get it to use another index for the operation. Without the sorting the query is twice as fast:

(304 total, Query took 0.0010 sec)

So if it ever becomes an issue I could move sorting to the application code. Hopefully by then merge_index is better implemented in MySQL.

Regarding the query strucutre - personally I use the subquery format since it is more compact and maintainable. It is always good however to be familiar with all the alternatives.

If you liked this article you should follow me on Twitter and/or share below:
  • syam kumar

    Thanks for this good explanation …

  • moreno

    if you’ve got 7000x faster execution times, i suppose its matter of the query cache. to compare execution times i prefer disable the query cache or make a stress test(10000 executions) with random arguments in user_id an assigned. In my stress test the OR variant works about 40% fasten than the UNION equivalent