Posts
288
Comments
25
Trackbacks
207
Monday, February 25, 2008
More Row Number Magic

Having posted last night about row numbers and ranking, I realised in the shower this morning that there's actually an easy way to get the first and last rows per group, just by using ROW_NUMBER. The trick is to PARTITION, meaning that the row number is reset per partition. So I could do:

SELECT  Employee,  TransactionTime,
  ROW_NUMBER() OVER (PARTITION BY Employee, dbo.DateOnly(TransactionTime) ORDER BY  TransactionTime) AS Row1,   ROW_NUMBER() OVER (PARTITION BY Employee, dbo.DateOnly(TransactionTime) ORDER BY  TransactionTime DESC) AS Row2
FROM ActualData

The result set is:

Employee TransactionTime Row1 Row2
Dave 01 Jan 08 09:00 1 4
Dave 01 Jan 08 13:00 2 3
Dave 01 Jan 08 14:00 3 2
Dave 01 Jan 08 18:00 4 1
Dave 02 Jan 08 09:00 1 2
Dave 02 Jan 08 17:30 2 1
Alex 01 Jan 08 09:00 1 2
Alex 01 Jan 08 17:30 2 1

So now the first row per group is where Row1 = 1 and the last row per group is where Row2 = 1. Easy.

posted @ Monday, February 25, 2008 9:10 AM | Feedback (0)
Finding Hidden Gems

Sometimes things pass you by. I've been using the ROW_NUMBER() function SQL Server 2005 for ages, along with Common Table Expressions, to do server side paging, but had completely missed some other new functions. This is an odd occurrence for me as I tend to always read the new features of any product, but somehow a few of the additions to SQL passed me by. In particular the RANK function, which today saved me a huge headache and will simplify a ton of my queries.

Over the past year I've been dealing with a lot of time based data and a new project is almost exclusively based around data that is time based. I've had the need to get the first and last entries per group. Consider employee data where there could be many entries per day and you need the first and last entry per employee per day; or even all entries, but the first and last highlighted in some way or processed differently. You can easily get the first and last by using MIN and MAX and grouping, but grouping has its own problems, when you require more columns that you are grouping on. The RANK function though, assigns a number based upon the grouping, rather than grouping the data, so each group is assigned a different rank. Consider the following set of data:

Employee TransactionTime
Dave 01 Jan 08 09:00
Dave 01 Jan 08 13:00
Dave 01 Jan 08 14:00
Dave 01 Jan 08 18:00
Dave 02 Jan 08 09:00
Dave 02 Jan 08 17:30
Alex 01 Jan 08 09:00
Alex 01 Jan 08 17:30

To get the first and last per employee per day I could easily use:

SELECT Employee, MIN(Time), MAX(Time)
FROM ActualData
GROUP BY Employee, dbo.DateOnly(Time)
ORDER BY Employee, dbo.DateOnly(Time)

This is a simple grouping (using one of the date and time functions from Jeff's Excellent blog). The problem occurs when I want addition columns in the result set; you can't just add them to the select list because they either have to be aggregates or added to the grouping. Let's now consider row numbers and ranking:

SELECT  Employee,  TransactionTime,
  ROW_NUMBER() OVER (ORDER BY Employee, dbo.DateOnly(TransactionTime)) AS RowNumber,
  RANK() OVER (ORDER BY Employee, dbo.DateOnly(TransactionTime)) AS Rank
FROM ActualData

Our result set is now

Employee TransactionTime RowNumber Rank
Dave 01 Jan 08 09:00 1 1
Dave 01 Jan 08 13:00 2 1
Dave 01 Jan 08 14:00 3 1
Dave 01 Jan 08 18:00 4 1
Dave 02 Jan 08 09:00 5 5
Dave 02 Jan 08 17:30 6 5
Alex 01 Jan 08 09:00 7 7
Alex 01 Jan 08 17:30 8 7

The row number is obvious, just a sequential number ordered by the employee name and transaction time. The rank is the same for each row where the order by is the same; so for each employee on the same day, the rank is the same. One thing that you instantly notice is that the first row of each grouping has the same number for row and rank, giving you a simple way to pick the first item. It doesn't, however, give you a way to identify the last item in a group; you've no way to know how many items there will be in each group, so rank can't help. You could probably do it with sub-queries and/or CTEs, more ranking and so on, but I wanted to avoid that if possible.

The big leap came when I discovered you can use aggregates with OVER and not just ROW_NUMBER and RANK (and DENSE_RANK and TILE, which are also new). The syntax is slightly different, but what you can do is:

SELECT  Employee,  TransactionTime,
  MIN(TransactionTime) OVER (PARTITION BY Employee, dbo.DateOnly(TransactionTime)) AS FirstEntry,
  MAX(TransactionTime) OVER (PARTITION BY Employee, dbo.DateOnly(TransactionTime)) AS LastEntry
FROM ActualData

The result set is now:

Employee TransactionTime FirstEntry LastEntry
Dave 01 Jan 08 09:00 01 Jan 08 09:00 01 Jan 08 18:00
Dave 01 Jan 08 13:00 01 Jan 08 09:00 01 Jan 08 18:00
Dave 01 Jan 08 14:00 01 Jan 08 09:00 01 Jan 08 18:00
Dave 01 Jan 08 18:00 01 Jan 08 09:00 01 Jan 08 18:00
Dave 02 Jan 08 09:00 02 Jan 08 09:00 02 Jan 08 17:30
Dave 02 Jan 08 17:30 02 Jan 08 09:00 02 Jan 08 17:30
Alex 01 Jan 08 09:00 01 Jan 08 09:00 01 Jan 08 17:30
Alex 01 Jan 08 17:30 01 Jan 08 09:00 01 Jan 08 17:30

The first and last rows per group are now easy to identify; the first row has TransactionTime = FirstEntry and the last row has TransactionTime = LastEntry. Numbering and ranking might be slightly more elegant, but for my data having the first and last on every row is actually useful; it means I can now replace the cursor that processes this data with a simple query.

These new functions are going to massively change the way I work with time based data in my current project. I now have to read up on the new features in SQL Server 2008, to make sure I don't miss anything there.

posted @ Monday, February 25, 2008 12:18 AM | Feedback (1)