Monday 12 December 2011

Querying for Top 1 of group by (row_number() subqueries)

Often, there are clusters of records and the requirement is to return only one record per cluster based on a max or min of one of the values.

In the past, I've done this a number of ways
  1. Group to get the max/min value (maxkey, shall we say) then create functions to retrieve the column values for primary keys plus maxkey within the source table
  2. Group to get the max/min value and create subqueries on the table within the select clause referencing the primary keys plus maxkey (same as #1 when it comes to performance)
  3. Collect all records handle the filtering of non maxkey records at the consumption level (outside of sql)
These have not been satisfying at all. But the use of row_number() under SQL2005 allows for a bit cleaner implementation. I would welcome feedback with respect to efficiency. I'd really like to believe that this is head and shoulders above the other options.

So here's the pseudo-query:

Get the entire set and capture the row_number() partitioned by your intended output, sorted by your maxkey column name (desc or asc)

Wrap this in an outer query and only capture rows with the row_number() = 1.

That's it.
So here's an example. You have a table with sellers, buyers, product and transaction date. You want to see all of the content from the table but only for the last transaction per buyer (to see what it was that was bought, who sold it).
my_table (seller_id (PK), buyer_id (PK), product_id(PK), transaction_date (PK))
-------------------------
select seller_id, buyer_id, product_id, transaction_date from
(
select row_number() over (partition by buyer_id order by transaction_date desc) rown
, seller_id, buyer_id, product_id, transaction_date
from my_table
) x where rown=1
-------------------------

So in reading a few places about expensive routines in SQL, I haven't seen issues mentioned with this type of technique. However, it's hard from this point to decide how nasty the magic of row_number() is for efficiency.

So time for feedback - is this an appropriate method for getting those top values? What is another method that could be better?

Thanks for reading...

No comments:

Post a Comment

Please provide your thoughts. I appreciate both flavours of feedback.