In SQL, aggregate functions will return the group by values or the aggregate function results, but it is difficult (or at least harder than it should be) to return the primary key or ROWID. In contrast, most programming languages will return the instance (or a pointer/reference to the instance) when searching for items.
Background
A project manager allocated me several weeks ago on emergency basis to help out with another project that was having difficulties with a SQL Historian system. I ended up developing a nice set of SQL tables, functions, and stored procedures to transfer data from a remote linked server into the Historian. It was actually rather fun to work on as it had several challenging aspects. Of course, after several late nights of development, I don't know how much fun like that I could take!
The key to the data transfer was obtaining the latest data from the remote database and inserting it into the Historian. The remote database was an Oracle database and the Historian was a SQL Server database. My first reaction was to do a simple query with a TOP specification but Oracle doesn't have a TOP function so I started to search for ways to achieve the same result. I came across the ROW_NUMBER() function and it did the trick although it required a sub-query such as:
SELECT *
FROM (
SELECT
SAMPLE_DAY
, ASSET
, TANK_LEVEL_MM
, TANK_PRES_PSIG
, TANK_VOL_M3
, FEED_FLOW_M3HR
, ROW_NUMBER() OVER (PARTITION BY ASSET ORDER BY SAMPLE_DAY DESC) RN
FROM REMOTE_DATA_VIEW_1
ORDER BY ASSET, SAMPLE_DAY DESC
)
WHERE RN = 1
It turns out that this query is more powerful than the TOP query as it essentially performs a GROUP BY aggregate (ASSET and order by SAMPLE_DAY descending) without some of the limitations of SQL aggregate functions.
Exploring Aggregate Queries
In a traditional SQL aggregate query, the columns returned by the query have to be in the GROUP BY clause or in an aggregate function. This means that it is difficult to return the primary key for a table because you usually are grouping by a name, location, or other non-unique field. Here is an example:
-- DDL: This table stores the tag names, source, and values. This
-- table is just for demonstrative purposes and does not represent
-- a normalized structure.
CREATE TABLE TAG_VALUE
(
ValueID int IDENTITY PRIMARY KEY
, DateTime datetime NOT NULL
, TagName varchar(50) NOT NULL
, Value numeric NULL
, Ignored bit NOT NULL DEFAULT 0
, Processed bit NOT NULL DEFAULT 0
)
GO
-- Get the most recent entry grouped by tag name
SELECT TagName
, MAX(DateTime) AS DateTime
FROM TAG_VALUE
GROUP BY TagName
If you try to put the primary key as a return column, you will get an error. Unfortunately this means you end up having an ugly subquery and potentially expensive (hopefully you have indices on the matching columns as well), such as:
SELECT ValueID
FROM TAG_VALUE AS V1
JOIN (
SELECT TagName
, MAX(DateTime) AS DateTime
FROM TAG_VALUE
GROUP BY TagName
) AS V2
ON ( (V1.TagName = V2.TagName)
AND (V1.DateTime = V2.DateTime))
Using ROW_NUMBER
The ROW_NUMBER function has the powerful feature of specifying the PARTITION BY which provides some of the same functionality that GROUP BY would perform but with less hassle. For example, the following query analyzes the TAG_VALUE table and looks for duplicate TagName rows. It then marks the "old" values as "Ignored" and leaves the newest entry alone.
UPDATE TAG_VALUE
SET Ignored = 1
FROM TAG_VALUE AS V1
JOIN (
SELECT ValueID
, ROW_NUMBER() OVER (PARTITION BY TagName ORDER BY DateTime DESC) AS RN
FROM TAG_VALUE
WHERE TagName IN (
SELECT DISTINCT TagName
FROM TAG_VALUE
GROUP BY TagName
HAVING COUNT(TagName) > 1
)
) AS V2
ON (V1.ValueID = V2.ValueID)
WHERE RN > 1
Conclusion
So, in conclusion, ROW_NUMBER provides an easier method of returning the actual row so you can perform updates, deletes, or return the primary key.
Sample SQL Code
/* ====
NOTE: The table structure resembles a database design that I have seen used.
I did not design the table structure and it is intentionally flat. Originally
this example used a remote query to allow SQL Server to query ORACLE, but for
the sample it is all in SQL Server. With very minor changes it works in Oracle
as well.
==== */
-- DDL: This table simulates a remote linked server
CREATE TABLE REMOTE_DATA_VIEW_1
(
SAMPLE_DAY datetime
, ASSET varchar(50)
, TANK_LEVEL_MM numeric
, TANK_PRES_PSIG numeric
, TANK_VOL_M3 numeric
, FEED_FLOW_M3HR numeric
)
GO
-- Sample data
INSERT REMOTE_DATA_VIEW_1 VALUES ('2008-02-04', 'ASSET_01', 1200.0, 18.5, 100.0, 150.0)
INSERT REMOTE_DATA_VIEW_1 VALUES ('2008-02-01', 'ASSET_01', 1100.0, 17.5, 90.0, 145.0)
INSERT REMOTE_DATA_VIEW_1 VALUES ('2008-01-25', 'ASSET_01', 1000.0, 16.5, 80.0, 155.0)
INSERT REMOTE_DATA_VIEW_1 VALUES ('2008-02-03', 'ASSET_02', 800.0, 19.5, 110.0, 160.0)
INSERT REMOTE_DATA_VIEW_1 VALUES ('2008-02-02', 'ASSET_02', 750.0, 17.0, 95.0, 165.0)
INSERT REMOTE_DATA_VIEW_1 VALUES ('2008-02-04', 'ASSET_03', 1500.0, 18.0, 100.0, 170.0)
INSERT REMOTE_DATA_VIEW_1 VALUES ('2008-01-31', 'ASSET_04', 1350.0, 19.0, 90.0, 135.0)
GO
-- Return the raw data
SELECT
SAMPLE_DAY
, ASSET
, TANK_LEVEL_MM
, TANK_PRES_PSIG
, TANK_VOL_M3
, FEED_FLOW_M3HR
, ROW_NUMBER() OVER (PARTITION BY ASSET ORDER BY SAMPLE_DAY DESC) RN
FROM REMOTE_DATA_VIEW_1
ORDER BY ASSET, SAMPLE_DAY DESC
/* ====
SAMPLE_DAY ASSET TANK_LEVEL_MM TANK_PRES_PSIG TANK_VOL_M3 FEED_FLOW_M3HR RN
----------------------- -------- ------------- -------------- ----------- -------------- --
2008-02-04 00:00:00.000 ASSET_01 1200 19 100 150 1
2008-02-01 00:00:00.000 ASSET_01 1100 18 90 145 2
2008-01-25 00:00:00.000 ASSET_01 1000 17 80 155 3
2008-02-03 00:00:00.000 ASSET_02 800 20 110 160 1
2008-02-02 00:00:00.000 ASSET_02 750 17 95 165 2
2008-02-04 00:00:00.000 ASSET_03 1500 18 100 170 1
2008-01-31 00:00:00.000 ASSET_04 1350 19 90 135 1
==== */
-- Use a simple embedded query
SELECT *
FROM (
SELECT
SAMPLE_DAY
, ASSET
, TANK_LEVEL_MM
, TANK_PRES_PSIG
, TANK_VOL_M3
, FEED_FLOW_M3HR
, ROW_NUMBER() OVER (PARTITION BY ASSET ORDER BY SAMPLE_DAY DESC) AS RN
FROM
REMOTE_DATA_VIEW_1
) DATA
WHERE RN = 1
/* ====
SAMPLE_DAY ASSET TANK_LEVEL_MM TANK_PRES_PSIG TANK_VOL_M3 FEED_FLOW_M3HR RN
----------------------- -------- ------------- -------------- ----------- -------------- --
2008-02-04 00:00:00.000 ASSET_01 1200 19 100 150 1
2008-02-03 00:00:00.000 ASSET_02 800 20 110 160 1
2008-02-04 00:00:00.000 ASSET_03 1500 18 100 170 1
2008-01-31 00:00:00.000 ASSET_04 1350 19 90 135 1
==== */
-- This is an uglier version of the query above but provided for comparison.
-- Return only the most recent entry using SAMPLE_DAY and ASSET as keys. On a production
-- table, there should be a real primary key or ROWID that you would use instead.
SELECT
DATA.SAMPLE_DAY
, DATA.ASSET
, DATA.TANK_LEVEL_MM
, DATA.TANK_PRES_PSIG
, DATA.TANK_VOL_M3
, DATA.FEED_FLOW_M3HR
FROM (
SELECT
SAMPLE_DAY
, ASSET
, ROW_NUMBER() OVER (PARTITION BY ASSET ORDER BY SAMPLE_DAY DESC) AS RN
FROM
REMOTE_DATA_VIEW_1
) AS REMOTE_QUERY
JOIN REMOTE_DATA_VIEW_1 AS DATA
ON ( (REMOTE_QUERY.SAMPLE_DAY = DATA.SAMPLE_DAY)
AND (REMOTE_QUERY.ASSET = DATA.ASSET) )
WHERE REMOTE_QUERY.RN = 1
ORDER BY DATA.ASSET, DATA.SAMPLE_DAY DESC
/* ====
SAMPLE_DAY ASSET TANK_LEVEL_MM TANK_PRES_PSIG TANK_VOL_M3 FEED_FLOW_M3HR
----------------------- -------- ------------- -------------- ----------- --------------
2008-02-04 00:00:00.000 ASSET_01 1200 19 100 150
2008-02-03 00:00:00.000 ASSET_02 800 20 110 160
2008-02-04 00:00:00.000 ASSET_03 1500 18 100 170
2008-01-31 00:00:00.000 ASSET_04 1350 19 90 135
==== */
GO
Technorati tags:
SQL,
SQL Server,
Oracle