Question: How do I determine the primary key that is 100 rows newer than the current row, for a subset of rows? The following solution seems to work AND it's fairly snappy performer in SQL Anywhere 12.0.1.3298; the question is, is there a better solution?
IF VAREXISTS ( '@partition_id' ) = 0 THEN CREATE VARIABLE @partition_id UNSIGNED INTEGER; END IF; IF VAREXISTS ( '@current_primary_key' ) = 0 THEN CREATE VARIABLE @current_primary_key UNSIGNED INTEGER; END IF; SET @partition_id = 2; SET @current_primary_key = 290505; SELECT newer_t.primary_key AS destination_primary_key FROM ( SELECT TOP 100 ROW_NUMBER() OVER ( ORDER BY T.primary_key ASC ) AS scrolling_row_number, t.primary_key FROM t WHERE t.partition_id = @partition_id AND t.primary_key > @current_primary_key ORDER BY t.primary_key ASC ) AS newer_t WHERE newer_t.scrolling_row_number = 100 ORDER BY newer_t.primary_key DESC;I'm sure all you front-row students out there already have a suggestion or two, but the rest of us need some more information. Here's the table:
CREATE TABLE DBA.t ( -- 487,445 rows, 54.5M total = 37.4M table + 0 ext + 17M index, 118 bytes per row partition_id /* X */ UNSIGNED INT NOT NULL, primary_key /* PK */ UNSIGNED BIGINT NOT NULL DEFAULT autoincrement, other_data_1 VARCHAR ( 1 ) NOT NULL DEFAULT 'N', other_data_2 VARCHAR ( 1 ) NOT NULL DEFAULT 'Y', other_data_3 VARCHAR ( 32767 ) NOT NULL DEFAULT '', other_data_4 BIGINT NOT NULL, other_data_5 BIGINT NOT NULL, other_data_6 /* X */ TIMESTAMP NOT NULL DEFAULT '1900-01-01', other_data_7 TIMESTAMP NOT NULL DEFAULT '1900-01-01', other_data_8 TIMESTAMP NOT NULL DEFAULT '1900-01-01', other_data_9 TIMESTAMP NOT NULL DEFAULT '1900-01-01', other_data_10 BIGINT NOT NULL DEFAULT 0, CONSTRAINT ASA415 PRIMARY KEY ( -- 904k primary_key ) ); CREATE CLUSTERED INDEX ix_other_data_6 ON DBA.t ( -- 12.7M other_data_6 ); CREATE INDEX ix_partition_id ON DBA.t ( -- 3.5M partition_id );Here's the question again: Question: How do I determine the primary key that is 100 rows newer than the current row, for a subset of rows? Here are some definitions:
- Current row: where the autoincrementing primary_key = x
- Desired result: y = the destination primary key, or NULL if there are fewer than 100 newer rows
- Newer: where the primary_key is > x
- Subset: where the partition_id = z
- The test database has 487,445 rows in this table; in the real world millions of rows are common.
- The test database has only 4 distinct values in partition_id, with two values alternating throughout almost all of the rows: 1, 2, 1, 2, ... for primary keys n, n + 1, n + 2, n + 3, ...
- The partition_id column doesn't work in the sense of horizontal partitioning or sharding; in some real-world databases all the rows have the same partition_id, in other cases there are a lot of partition_id values but only a few account for most of the rows, and in this case (as noted above) the rows are evenly divided between two partition_id values.
- The primary_key may be a perfect candidate for a CLUSTERED index, but the other_data_6 column is used in expensive range queries (not discussed here) so it got the CLUSTERED index. Sadly, there is apparently a world-wide shortage of the "CLUSTERED" keyword so it has been decreed that only one index per table be CLUSTERED; perhaps that doesn't matter if I read Anil Goel's comment on this post correctly: Multiple Clustered Indexes
SELECT newer_t.primary_key AS destination_primary_key FROM ( SELECT TOP 100 ROW_NUMBER() OVER ( ORDER BY T.primary_key ASC ) AS scrolling_row_number, t.primary_key FROM t WHERE t.partition_id = @partition_id AND t.primary_key > @current_primary_key ORDER BY t.primary_key ASC ) AS newer_t WHERE newer_t.scrolling_row_number = 100 ORDER BY newer_t.primary_key DESC;
- The ROW_NUMBER() OVER clause on line 3 assigns a row number 1, 2, 3, ... for the rows in the inner SELECT.
- It is important (I think, at least in this case) for the OVER ORDER BY on line 3 to be the same as the SELECT ORDER BY on line 8. If they're different (in this case) funky things happen with regards to returning the right answer (that part, I know for sure).
- The predicate t.partition_id = @partition_id on line 6 limits the candidate rows for the inner SELECT to exactly one partition.
- The predicate t.primary_key > @current_primary_key on line 7 limits the candidate rows to "newer" rows.
- The SELECT TOP 100 clause on line 2 limits the inner result set to the first 100 "newer" rows.
- The WHERE clause on line 10 throws away the first 99 "newer" rows and takes row number 100.
- The outer ORDER BY on line 11 is an "oops"... clearly the outer SELECT don't need no steenking ORDER BY. The query engine doesn't eliminate the clause, but at the same time (judging by the plans) it doesn't spend any time executing it either.
- a run with a cold cache,
- followed by an immediate repeat of exactly the same query, and
- then a test of the "next step forward": the destination_primary_key returned by the previous query was fed into the @current_primary_key for the step forward.
Estimated Actual RunTime Runtime ======= ======= Cold cache 10.418 1.7735 Repeat 1.7516 0.55487 Step forward 1.4718 0.39121You can look at the plans in dbisql; here's where to get them: cold_cache_plan.saplan repeat_plan.saplan step_forward_plan.saplan Here's a quick look at table t for the first two plans (cold cache and immediate repeat); you can see that when the cache warmed up the table scan was replaced by an index: So, besides getting rid of the oops-es, what else can I do or try?