In my last post on database optimization, I focused on improving query performance by optimizing schema - exploring indexing strategies by reading the execution plan. In this post I'll show how different query structures can also have a major impact on performance.
Dealing with OR operators in a WHERE clause of an SQL statement in MySQL can be tricky. Up until recently, MySQL could only use one index per table referenced in a query. A multi-column index can be used for filtering conditions with an AND operator (which is more restrictive by nature), but a condition added by OR must use a separate index because of the logical nature of the opertaor (a union, as opposed to an intersection that the AND represents).
MySQL 5.0 added the index_merge select type, which allows the query optimizer to possibly select several indexes from a single table and merge them to improve query performance. I say possibly, since leaving such decisions to the optimizer is risky at best. In fact, as I will show next, you are sometimes left with no indexes selected out of several possible options, resulting in a full table scan.
Continuing from my last post, I'll use a real-world example to show the different paths the queries optimizer can take when preparing an execution plan for our queries.
I'll actually be working with the same query I profiled last time, with a minor change. The relevant table structure is as follows:
-- -- Table structure for table `tasks` -- CREATE TABLE IF NOT EXISTS `tasks` ( `id` int(13) NOT NULL AUTO_INCREMENT, `list_id` int(13) NOT NULL, `user_id` int(13) NOT NULL, `task` varchar(255) collate utf8_unicode_ci NOT NULL, `due` timestamp NULL DEFAULT NULL, `created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, `checked` timestamp NULL DEFAULT NULL, `assigned` int(13) DEFAULT NULL, `done` enum('1') collate utf8_unicode_ci DEFAULT NULL, PRIMARY KEY (`id`), KEY `user_id` (`user_id`,`due`), KEY `assigned` (`assigned`,`due`), KEY `due` (`due`) ) ENGINE=InnoDB
And the query itself:
SELECT `tasks`.`id`, `tasks`.`task`, UNIX_TIMESTAMP(due) AS `time`, `lists`.`name` AS `list_name`, `lists`.`id` AS `list_id` FROM `tasks` INNER JOIN `lists` ON lists.id=tasks.list_id WHERE (tasks.user_id='1' OR assigned='1') AND (tasks.done IS NULL) AND (tasks.due IS NOT NULL) ORDER BY `tasks`.`due` ASC
This query is almost the same as the one I worked with last time. The only difference being the first statement in the WHERE clause, involving an OR operator:
(tasks.user_id='1' OR assigned='1')
Just to understand what I'm doing here - The query selects from the tasks table with several filtering criteria. The last OR statement conveys the condition that the tasks selected either belong to specific user (i.e created by him) or assigned to him (those two possibly coincide).
For testing purposes I will be running the queries against a testing database I set up with plenty of mock data. The database is around 1Gb in total size, with the tasks table at about 1.8 million rows. It's not very large, but enough for significant data to be obtained while allowing relatively online tampering with schema (modifying keys takes only around 4 minutes to complete).
Running in original form computes as (average of 10 queries):
(304 total, Query took 6.84 sec)
6.84 sec is way too long for running a query in a typical web application. If you'd recall, I last got this query running at 0.0008 sec without the additional OR condition, meaning it is running at around 7,000 times slower (yikes). In such a case you would suspect the indexes are not selective enough, and sure enough an EXPLAIN reveals:
MySQL is performing a regular index select and not an index_merge as we'd like, and doing that it selects the worst possible index - the only one that doesn't filter the result set. Sure enough, all 1.8M rows are scanned and the query is underperforming badly. Trying to force the issue, I first add an IGNORE INDEX(due) to remove it from the equation:
(304 total, Query took 1.41 sec)
|1||SIMPLE||tasks||ALL||user_id,assigned||NULL||NULL||NULL||1870838||Using where; Using filesort|
The query optimizer has decided not to use an index at all, despite several good candidates. However, query performance is improved since a full table scan with no index is performed straight up. It's still way too slow, and we'd like it to use a filtering index so it can avoid a full table scan. Trying to force the other indexes doesn't work so we're left with no choice but to try alternative query structures.
Our first candidate is replacing our OR condition with two UNION'ed select statements (the inspiration is from a couple of posts over at the MySQL performance blog). Breaking the original query into a UNION form results in:
(SELECT `tasks`.`id`, `tasks`.`task`, UNIX_TIMESTAMP(due) AS `time`, `lists`.`name` AS `list_name`, `lists`.`id` AS `list_id` FROM `tasks` INNER JOIN `lists` ON lists.id=tasks.list_id WHERE (tasks.done IS NULL) AND (tasks.due IS NOT NULL) AND (tasks.user_id='1') ) UNION ALL ( SELECT `tasks`.`id`, `tasks`.`task`, UNIX_TIMESTAMP(due) AS `time`, `lists`.`name` AS `list_name`, `lists`.`id` AS `list_id` FROM `tasks` INNER JOIN `lists` ON lists.id=tasks.list_id WHERE (tasks.done IS NULL) AND (tasks.due IS NOT NULL) AND (assigned='1' AND tasks.user_id!='1') ) ORDER BY `time` ASC
This unsightly looking query looks much more complex, but actually gets the job done:
(304 total, Query took 0.0021 sec)
A big improvement to say the least (~3500 faster than the original query).
The explain shows why:
|NULL||UNION RESULT||<union1,2>||ALL||NULL||NULL||NULL||NULL||NULL||Using filesort|
Only 304 rows are scanned (as opposed to the entire 1.8M table before). Despite the filesort, we are in the area of usability for this query.
Another approach would be to determine why the index_merging isn't take place and how can we enforce it. By reducing the WHERE clause to include only the columns covered by an index, index_merging kicks in resulting in a similar performance as the union, albeit with less selective filtering:
(368 total, Query took 0.0018 sec)
We could filter our result set in the application, but I'd like to keep it in the SQL where it belongs. In order to integrate the rest of the filtering statements, I use an IN condition on another index I have available - the primary key. This results in:
SELECT `tasks`.`id`, `tasks`.`task`, UNIX_TIMESTAMP(due) AS `time`, `lists`.`name` AS `list_name`, `lists`.`id` AS `list_id` FROM `tasks` INNER JOIN `lists` ON lists.id=tasks.list_id WHERE (tasks.user_id='1' OR assigned='1') AND tasks.id IN (SELECT id FROM tasks WHERE (tasks.done IS NULL) AND (tasks.due IS NOT NULL) ) ORDER BY `tasks`.`due` ASC
Which brings us back to the performance levels of the UNION form:
(304 total, Query took 0.0021 sec)
And we can see in the EXPLAIN that it is indeed an index_merge operation:
|1||PRIMARY||tasks||index_merge||user_id,assigned||user_id,assigned||4,5||NULL||369||Using sort_union(user_id,assigned); Using where; Using filesort|
|2||DEPENDENT SUBQUERY||tasks||unique_subquery||PRIMARY,due||PRIMARY||4||func||1||Using where|
So there you have it. From 6.83 seconds to 0.0021 seconds with some tweaking to the query structure. I am still not completely satisfied with the fact that it's using filesort, but I couldn't get it to use another index for the operation. Without the sorting the query is twice as fast:
(304 total, Query took 0.0010 sec)
So if it ever becomes an issue I could move sorting to the application code. Hopefully by then merge_index is better implemented in MySQL.
Regarding the query strucutre - personally I use the subquery format since it is more compact and maintainable. It is always good however to be familiar with all the alternatives.