Christopher Doyle

SQL: finding latest records within groups

Feb 17, 2024

Setup

A common task in databases is to find the latest X within each group Y. Let’s take the StackOverflow2010 database, and try o find the latest posts for each user. The Posts and Users tables look like:

SELECT TOP 100
    Users.DisplayName,
    Posts.Title,
    Posts.CreationDate
FROM Posts
INNER JOIN Users ON Posts.OwnerUserId = Users.Id
DisplayName Title CreationDate
Eggs McLaren Convert Decimal to Double? 2008-07-31 21:42:52.667
Kevin Dente Percentage width child element in absolutely positioned parent on Internet Explorer 7 2008-07-31 22:08:08.620
Kevin Dente NULL 2008-07-31 22:17:57.883
Jeff Atwood “How do I calculate someone’s age in C#?” 2008-07-31 23:40:59.743
Jeff Atwood Calculate relative time in C# 2008-07-31 23:55:37.967

We can find the latest post for a user like this:

SELECT TOP 1
    Users.DisplayName,
    Posts.Title,
    Posts.CreationDate
FROM Posts
INNER JOIN Users ON Posts.OwnerUserId = Users.Id
WHERE Posts.OwnerUserId = ?
ORDER BY Posts.CreationDate DESC

But what if we want the latest post for every user? If we take away the where clause, we still only get one row; a different tactic is needed. Actually, there is a way to use TOP 1 like this, more on that later.

CTE and MAX

A common solution is to use a subquery or CTE to find the MAX, then join back onto Posts like this:

WITH CTE_LatestPost AS (
    SELECT
        OwnerUserId,
        MaxCreationDate = MAX(CreationDate)
    FROM Posts
    GROUP BY OwnerUserId
)
SELECT
    Users.DisplayName,
    Posts.Title,
    Posts.CreationDate
FROM Posts
INNER JOIN Users ON Posts.OwnerUserId = Users.Id
INNER JOIN CTE_LatestPost
    ON Posts.OwnerUserId = CTE_LatestPost.OwnerUserId
    AND Posts.CreationDate = CTE_LatestPost.MaxCreationDate

This will work, but joining on CreationDate is not ideal as there could be two rows with the same creation date for a user — unlikely in this dataset, but true in general. We can ue Id instead, but then we are relying on Id increasing directly with CreationDate. Depending on the dataset this might not be possible.

It’s also quite slow. This query takes about 20 seconds on my laptop. A big part of the reason for the slowness is the self-join on Posts.

Temp table

Self-joins like this are generally black holes of performance for SQL Server. It doesn’t always know how to optimize the query well, it’s not obvious what you are doing. In particular it won’t realize that the CTE is quite small, and that it could evaluate it first on it’s own. We can force it to do this by putting the results into a temp table:

SELECT
    OwnerUserId,
    MaxCreationDate = MAX(CreationDate)
INTO #LatestPost
FROM Posts
GROUP BY OwnerUserId

SELECT
    Users.DisplayName,
    Posts.Title,
    Posts.CreationDate
FROM Posts
INNER JOIN Users ON Posts.OwnerUserId = Users.Id
INNER JOIN #LatestPost
    ON Posts.OwnerUserId = #LatestPost.OwnerUserId
    AND Posts.CreationDate = #LatestPost.MaxCreationDate;

This takes even longer at 24 seconds! Unfortunately, in this case, #LatestPost is rather large and takes 10 seconds to populate on its own. In other examples, this solution can net a quick and easy performance benefit.

Windowing Functions

Now we use ROW_NUMBER. This keeps track of the row number within each group that we specify, in some order that we specify. Since we order by CreationDate DESC, if we look at just RN = 1, then we have the latest occurrences. This also avoids draws, though the draw winner is arbitrary and non-deterministic. We could order by another (unique) column as well (such as Id) to solve this.

WITH CTE_LatestPost AS (
    SELECT
        OwnerUserId,
        Title,
        CreationDate,
        RN = ROW_NUMBER() OVER (
            PARTITION BY OwnerUserId
            ORDER BY CreationDate DESC
        )
    FROM Posts
)
SELECT
    Users.DisplayName,
    CTE_LatestPost.Title,
    CTE_LatestPost.CreationDate
FROM CTE_LatestPost
INNER JOIN Users ON CTE_LatestPost.OwnerUserId = Users.Id
WHERE CTE_LatestPost.RN = 1

This takes only 12 seconds.

The advantage of this query is that we don’t need a self-join. In the first query, we are joining Posts onto Posts. Here we can select all the columns from Posts that we need inside the CTE, so there is no need to join back.

With Ties + Windowing Function

Perhaps you feel needing a CTE/subquery is still a bit cumbersome. Here is a solution in a single query that uses the ROW_NUMBER partitioning with TOP 1 WITH TIES to give us the latest within each group.

SELECT TOP 1 WITH TIES
    Users.DisplayName,
    Posts.Title,
    Posts.CreationDate
FROM Posts
INNER JOIN Users ON Posts.OwnerUserId = Users.Id
ORDER BY ROW_NUMBER() OVER (
    PARTITION BY Posts.OwnerUserId
    ORDER BY Posts.CreationDate DESC
)

TOP 1 WITH TIES gives us the latest from the ORDER BY, and if there are ties for first place then it gives us all of them. Our ORDER BY is designed to give ties, it will rank the latest post for each OwnerUserId as joint first in the ORDER BY.

This is a slightly slower solution at about 14 seconds.

Adding an index

Until now, the only index on Posts was a unique index on Id, which we are not using.

Let’s create an index for our workload:

CREATE NONCLUSTERED INDEX IX_Posts_CreationDate
    ON Posts (CreationDate)
    INCLUDE (Title, OwnerUserId)

Rerunning the timing tests we get:

Looking closer: IO usage

The temp table solution is slightly faster than our windowing solution. Now let’s compare their IO usage using SET STATISTICS TIME:

Table 'Posts'. Scan count 6, logical reads 24809, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Users'. Scan count 0, logical reads 785856, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Posts'. Scan count 1, logical reads 24547, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table '#LatestPost_________________________________________________________________________________________________________000000000004'. Scan count 1, logical reads 681, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Posts'. Scan count 6, logical reads 24781, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Users'. Scan count 6, logical reads 7769, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

So while the temp table solution is quicker, it’s hiting the tables significantly harder. On the other hand, the CPU times (measure of CPU load) of the temp table is much lower than the windowing function. Overall this is a tradeoff between CPU and IO. Which method is faster will depend on the particular workflow and hardware. Which method is best will depend on what else you have running on these tables. If you are already IO constrained, it might be preferable to take a slightly slower overall query that has much lower CPU load.

Further refinement: temp table indexes

We can get another small benefit (about 2.5 second total time) by adding a primary key to our temp table:

CREATE TABLE #LatestPost (
    OwnerUserId int NOT NULL PRIMARY KEY,
    MaxCreationDate datetime NOT NULL
)

INSERT INTO #LatestPost (OwnerUserId, MaxCreationDate)
SELECT
    OwnerUserId,
    MaxCreationDate = MAX(CreationDate)
FROM Posts
GROUP BY OwnerUserId

SELECT
    Users.DisplayName,
    Posts.Title,
    Posts.CreationDate
FROM Posts
INNER JOIN Users ON Posts.OwnerUserId = Users.Id
INNER JOIN #LatestPost
    ON Posts.OwnerUserId = #LatestPost.OwnerUserId
    AND Posts.CreationDate = #LatestPost.MaxCreationDate;

We could also add an index on MaxCreationDate but this probably won’t have a huge impact because of the time required to build it. Sometimes the cost of adding a primary key is also not worth it, again it depends on workload.

This leads to my final thought on optimizing this solution: engineering. Compared to other solutions, we now have a solution which is 3 statements (CREATE, INSERT, SELECT) compared to 1 (SELECT). It is also a bit harder to maintain: we need to keep the temp table schema in sync with the Posts table, in other words our solution is more tightly coupled to the source tables. The WITH TIES solution is slick and concise, but how many people on your team will understand it? These things might not matter in a toy problem, but if this strategy is used extensively, it can become burdonsome to maintain. It can also lead to errors: if CreationDate is changed from a datetime to a date (it is called a Date, but this is encoding a time; in other datasets, particularly older ones, it is not uncommon to see pure dates in a datetime column), and the temp table schema is not updated, then you are know adding a small cast to the insert, and another cast to the join in the select. Sometimes this penalty is imperceptible, but it could have a bigger impact then the savings of this solution. The best solution is not always the one that is quickest right now, and sometimes a slightly worse solution is better in the long run.


← Back to all articles