Fetching specific rows from a group with MySQL

In a normalized database, we store separate entities in separate tables and define relationships between them, those being – one-to-one, one-to-many and many-to-many. We often want to fetch data which is split over two or more related tables, which is dead simple when fetching the extra data from the ‘one’ side of a relationship (ie, from a “parent” table):

SELECT product_images.src,products.name 
FROM product_images 
INNER JOIN products ON product_images.parent_id=products.id

Literally: Get the name of the parent product for each row in the products_images table.

Going the other direction is not as trivial. If we just want all the referenced rows in the child table, the query seen before would do the trick. But if we wanted a specific row to be found – for example, the latest inserted image, it becomes somewhat more complicated.

Lets explore a common scenario – fetching the latest comment for each row in a list of posts in a forum.


Latest posts from the devnet forums

Dealing with the following simplified schema for a forum:

CREATE TABLE `threads` (
   `id` INT NOT NULL  AUTO_INCREMENT,
   `title` VARCHAR( 250 ) NOT NULL ,
   `created` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ,
   PRIMARY KEY ( `id` )
) ENGINE = InnoDB;

CREATE TABLE `posts` (
   `id` INT NOT NULL  AUTO_INCREMENT,
   `thread_id` INT NOT NULL ,
   `content` TEXT NOT NULL ,
   `created` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ,
   PRIMARY KEY ( `id` ) ,
   INDEX ( `thread_id` )
) ENGINE = InnoDB;

A naive approach to fetching 10 threads with the latest post for each thread would be something like:

SELECT threads.*, posts.* 
FROM threads 
LEFT JOIN posts ON posts.thread_id=threads.id
GROUP BY threads.id
ORDER BY posts.created DESC
LIMIT 10

This query might appear to work as long as there is only one post per thread. Once there are more, you will start getting unexpected results (go ahead, try it). Lets try and understand why:

In the SQL standard, a GROUP BY modifier can return only columns that are either aggregated or the same for the entire group. MySQL extends this and allows returning non unique or aggregated columns from the group, instead of throwing an error.

However, values from those columns could come from any row in the group and we have no control over which row it will be.

… all rows in each group should have the same values for the columns that are ommitted from the GROUP BY part. The server is free to return any value from the group, so the results are indeterminate unless all values are the same.
The MySQL manual on GROUP BY modifiers

In addition, grouping occurs before the order clause is even applied. After the grouping, unpredictable values for the order column (posts.created) would exist for each group, making the sorting afterward somewhat meaningless (since we wanted the latest post for each thread).

A more experienced SQL user might think – why not just use an aggregate function instead? leading to the following query:

SELECT threads.*, posts.*, MAX(posts.created) AS maxcreated 
FROM threads 
LEFT JOIN posts ON posts.thread_id=threads.id
GROUP BY threads.id
ORDER BY maxcreated DESC
LIMIT 10

This is somewhat better, as it does retrieve the timestamp belonging to the latest post in each thread – however, surprisingly (or not, if you’ve been following closely) the rest of the non grouped columns from the threads table are still indeterminate and do not necessarily belong to the row with the latest timestamp.

The solution to this issue is not completely trivial, but it makes sense once you understand the limitations. The key is to use the GROUP BY statement to return the filtering criteria (latest timestamps from the posts table) but to avoid it affecting the result set directly.

First, the filtering criteria –

SELECT thread_id,MAX(created) AS created 
FROM posts 
GROUP BY posts.thread_id

Here we return only aggregate (latest timestamp) and unique (thread_id) data for each group. Now we have the timestamp of the latest post from each thread, along with the matching thread identifier.

We will move the filtering criteria into a subquery, and JOIN that against the threads and posts table respectively.

SELECT threads.*, posts.*
FROM threads
LEFT JOIN (
   SELECT thread_id,MAX(created) AS created FROM posts 
   GROUP BY thread_id
) AS latest ON latest.thread_id = threads.id
LEFT JOIN posts ON latest.created=posts.created AND 
latest.thread_id=posts.thread_id
ORDER BY posts.created DESC
LIMIT 10

Literally: Fetch rows from the threads table, (left) join that with the timestamp from the latest post in each thread, and then (left) join that with the posts table with the condition that the timestamp and thread_id match. This returns all the rows in the threads table along with the latest post from each thread.

People have a disposition for avoiding subqueries due to performance concerns, but in this case it is a derived subquery (also referred to as a derived table) which will compute once for the entire query as opposed to a correlated subquery which computes once per row ( = bad performance with many rows).

* Note – in real-world situations you would like to specify the columns to be returned in the result set

For more advanced usage, I suggest you read Baron Schwartz’s article on selecting the first/least/max row per group in SQL

To know when the next article is published, please subscribe to new articles using your Email below or follow me on Twitter.

Subscribe to Blog via Email

Enter your email address to receive notification about new posts.

  • Pingback: Generating graphs from MySQL table data « techfounder()

  • Nigel Peck

    Thanks for this. Nice article, and exactly what I’m looking for the solution to, but my posts table has 63,000 entries and the query takes almost 6 seconds, which I assume is the time taken for the sub-query figuring out the MAX post_time on all of the posts for each topic.

    I am trying to get the 15 most recent topics with posts. The only solution I can think of is to select the 100 most recent posts from any topic including duplicates, ordered by descending post time, and then remove duplicates in code from the set returned by that query.

    It means in some cases, where a lot of discussion has been occurring on a small number of topics, that there will not be enough left to give me 15 unique topics, but 6 seconds is obviously unacceptable.

    Other than that only solution I can think of is to maintain a table of latest posts for each topic.

    My query based on help from your article:

    SELECT
    `phpbb_topics`.`topic_id`,
    `phpbb_topics`.`forum_id`,
    `phpbb_topics`.`topic_title`,

    `phpbb_posts`.`post_id`,
    `phpbb_posts`.`poster_id`,
    `phpbb_posts`.`post_time`,

    `phpbb_forums`.`forum_name`

    FROM
    `phpbb_topics`

    LEFT JOIN

    (

    SELECT
    topic_id, MAX(post_time)
    AS
    post_time
    FROM
    phpbb_posts
    GROUP BY
    topic_id

    ) AS latest

    ON
    latest.topic_id = phpbb_topics.topic_id

    LEFT JOIN
    `phpbb_posts`

    ON
    latest.post_time = phpbb_posts.post_time

    AND
    latest.topic_id = phpbb_posts.topic_id

    INNER JOIN
    `phpbb_forums`

    ON
    `phpbb_forums`.`forum_id` = `phpbb_topics`.`forum_id`

    WHERE
    `phpbb_forums`.`auth_view` = ‘0’

    ORDER BY
    `phpbb_posts`.`post_time` DESC

    LIMIT
    15;

  • Nigel Peck

    Solved it by only using posts from the last month in the sub query:

    SELECT
    `phpbb_topics`.`topic_id`,
    `phpbb_topics`.`forum_id`,
    `phpbb_topics`.`topic_title`,

    `phpbb_posts`.`post_id`,
    `phpbb_posts`.`poster_id`,
    `phpbb_posts`.`post_time`,

    `phpbb_forums`.`forum_name`

    FROM
    `phpbb_topics`

    LEFT JOIN

    (

    SELECT
    topic_id, MAX(post_time)
    AS
    post_time
    FROM
    phpbb_posts
    WHERE
    FROM_UNIXTIME(post_time) >= DATE_SUB( CURDATE( ) , INTERVAL 1 MONTH )
    GROUP BY
    topic_id

    ) AS latest

    ON
    latest.topic_id = phpbb_topics.topic_id

    LEFT JOIN
    `phpbb_posts`

    ON
    latest.post_time = phpbb_posts.post_time

    AND
    latest.topic_id = phpbb_posts.topic_id

    INNER JOIN
    `phpbb_forums`

    ON
    `phpbb_forums`.`forum_id` = `phpbb_topics`.`forum_id`

    WHERE
    `phpbb_forums`.`auth_view` = ‘0’

    ORDER BY
    `phpbb_posts`.`post_time` DESC

    LIMIT
    15;

  • Nigel Peck

    Sorry but one more update, to be complete in case it is useful to others reading later. Didn’t like the previous solution. Turns out phpbb maintains the ID of the latest post against each, so it is a simple case of using that and no sub-query required.

    This is a very quick query compared to the others, and I am not sure how valid your comments about the sub-query you use being efficient are valid. It needs to do a hell of a lot of work, even if it does only do it once. My 30 days solution is a hack and equally inappropriate.

    Not sure how this could be done with a query in a proper and efficient manner but will leave that for now as my problem is solved!

    Here’s the query:

    SELECT
    `phpbb_topics`.`topic_id`,
    `phpbb_topics`.`forum_id`,
    `phpbb_topics`.`topic_title`,

    `phpbb_posts`.`post_id`,
    `phpbb_posts`.`poster_id`,
    `phpbb_posts`.`post_time`,

    `phpbb_forums`.`forum_name`

    FROM
    `phpbb_topics`

    INNER JOIN

    `phpbb_posts`

    ON
    `phpbb_posts`.`post_id` = `phpbb_topics`.`topic_last_post_id`

    INNER JOIN
    `phpbb_forums`

    ON
    `phpbb_forums`.`forum_id` = `phpbb_topics`.`forum_id`

    WHERE
    `phpbb_forums`.`auth_view` = ‘0’

    ORDER BY
    `phpbb_posts`.`post_time` DESC

    LIMIT
    15;

  • http://www.binpress.com Eran Galperin

    Hi Nigel,

    I’ll reply to all your comments together here, if you don’t mind.

    First, this is another tool for your toolset with SQL, not a hammer that can be used in any situation. It can be useful, but it’s not always appropriate. Having said that, 63k rows is a very small table. I’m using this techniques on tables with millions of rows, with queries that complete is several mili-seconds. All my comments come from real world experience, and are valid – I did not make this stuff up.

    Intuitively, I would say that there’s an indexing problem you need to take care of to make this query complete quickly. I would suggest running the subquery by itself and optimizing it before trying to optimize the larger query.

    Of course, not all such queries can be optimized, and in that case you’re better off trying something else. There’s no one-solution that fits all.

  • Nigel Peck

    Thanks Eran that’s great, will have a look at the indexing.