In a normalized database, we store separate entities in separate tables and define relationships between them, those being - one-to-one, one-to-many and many-to-many. We often want to fetch data which is split over two or more related tables, which is dead simple when fetching the extra data from the 'one' side of a relationship (ie, from a "parent" table):
SELECT product_images.src,products.name FROM product_images INNER JOIN products ON product_images.parent_id=products.id
Literally: Get the name of the parent product for each row in the products_images table.
Going the other direction is not as trivial. If we just want all the referenced rows in the child table, the query seen before would do the trick. But if we wanted a specific row to be found - for example, the latest inserted image, it becomes somewhat more complicated.
Lets explore a common scenario - fetching the latest comment for each row in a list of posts in a forum.
Latest posts from the devnet forums
Dealing with the following simplified schema for a forum:
CREATE TABLE `threads` ( `id` INT NOT NULL AUTO_INCREMENT, `title` VARCHAR( 250 ) NOT NULL , `created` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP , PRIMARY KEY ( `id` ) ) ENGINE = InnoDB; CREATE TABLE `posts` ( `id` INT NOT NULL AUTO_INCREMENT, `thread_id` INT NOT NULL , `content` TEXT NOT NULL , `created` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP , PRIMARY KEY ( `id` ) , INDEX ( `thread_id` ) ) ENGINE = InnoDB;
A naive approach to fetching 10 threads with the latest post for each thread would be something like:
SELECT threads.*, posts.* FROM threads LEFT JOIN posts ON posts.thread_id=threads.id GROUP BY threads.id ORDER BY posts.created DESC LIMIT 10
This query might appear to work as long as there is only one post per thread. Once there are more, you will start getting unexpected results (go ahead, try it). Lets try and understand why:
In the SQL standard, a GROUP BY modifier can return only columns that are either aggregated or the same for the entire group. MySQL extends this and allows returning non unique or aggregated columns from the group, instead of throwing an error.
However, values from those columns could come from any row in the group and we have no control over which row it will be.
... all rows in each group should have the same values for the columns that are ommitted from the GROUP BY part. The server is free to return any value from the group, so the results are indeterminate unless all values are the same.
The MySQL manual on GROUP BY modifiers
In addition, grouping occurs before the order clause is even applied. After the grouping, unpredictable values for the order column (posts.created) would exist for each group, making the sorting afterward somewhat meaningless (since we wanted the latest post for each thread).
A more experienced SQL user might think - why not just use an aggregate function instead? leading to the following query:
SELECT threads.*, posts.*, MAX(posts.created) AS maxcreated FROM threads LEFT JOIN posts ON posts.thread_id=threads.id GROUP BY threads.id ORDER BY maxcreated DESC LIMIT 10
This is somewhat better, as it does retrieve the timestamp belonging to the latest post in each thread - however, surprisingly (or not, if you've been following closely) the rest of the non grouped columns from the threads table are still indeterminate and do not necessarily belong to the row with the latest timestamp.
The solution to this issue is not completely trivial, but it makes sense once you understand the limitations. The key is to use the GROUP BY statement to return the filtering criteria (latest timestamps from the posts table) but to avoid it affecting the result set directly.
First, the filtering criteria -
SELECT thread_id,MAX(created) AS created FROM posts GROUP BY posts.thread_id
Here we return only aggregate (latest timestamp) and unique (thread_id) data for each group. Now we have the timestamp of the latest post from each thread, along with the matching thread identifier.
We will move the filtering criteria into a subquery, and JOIN that against the threads and posts table respectively.
SELECT threads.*, posts.* FROM threads LEFT JOIN ( SELECT thread_id,MAX(created) AS created FROM posts GROUP BY thread_id ) AS latest ON latest.thread_id = threads.id LEFT JOIN posts ON latest.created=posts.created AND latest.thread_id=posts.thread_id ORDER BY posts.created DESC LIMIT 10
Literally: Fetch rows from the threads table, (left) join that with the timestamp from the latest post in each thread, and then (left) join that with the posts table with the condition that the timestamp and thread_id match. This returns all the rows in the threads table along with the latest post from each thread.
People have a disposition for avoiding subqueries due to performance concerns, but in this case it is a derived subquery (also referred to as a derived table) which will compute once for the entire query as opposed to a correlated subquery which computes once per row ( = bad performance with many rows).
* Note - in real-world situations you would like to specify the columns to be returned in the result set
For more advanced usage, I suggest you read Baron Schwartz's article on selecting the first/least/max row per group in SQL