Top-n query tuning in huge OLTP applications

Marcin Miller's picture
articles: 

Top-n Query is one of the more advanced SQL problems. How does one retrieve N first (or least) rows from a record set? For example, how does one find the top five highest-paid employees in a given department?

People with no experience in such situation usually try to solve the problem with a query like this:

SELECT e.*
 FROM emp e
WHERE ROWNUM < 6
 ORDER BY sal DESC

Of course, this is the incorrect solution. The 'where rownum<6' condition is executed BEFORE all the rows are sorted. This means that instead of the top five highest-paid employees, you get a random set of five employees sorted by their salaries. Below is the correct solution:

SELECT *
FROM   (SELECT   ROWNUM,
                 e.*
        FROM     emp e
        ORDER BY sal DESC)
WHERE  ROWNUM < 6

Similar queries are often used in various reports and sometimes have much more complicated structure - for instance:

SELECT ename,
       sal,
       dname,
       manager
FROM   (SELECT   ROWNUM,
                 e.ename,
                 e.sal,
                 d.dname,
                 e2.ename AS manager
        FROM     emp e,
                 dept d,
                 emp e2
        WHERE    e.deptno = d.deptno
                 AND e.mgrno = e2.empno
                 AND e.job = 'SALESMAN'
        ORDER BY e.sal DESC)
WHERE  ROWNUM < 6;

In this case, the Oracle optimizer has to create a whole Cartesian product described by the inline view, then sort it out and finally retrieve the first five rows. This query is executed very quickly on SCOTT's tables, but the problem becomes complicated if the table EMP consists of one milion rows. The problem also becomes even more complicated if (for example) besides just the name of employee and his/her manager, you also need to get the employee's address, phone number, shop's address, etc. All these data must be read from another tables and joined with the inline view before sorting. It may cause the query to execute very slowly.
Is there any other way to do it?

Yes, there is. Just modify the starting query:

SELECT   e.ename,
         e.sal,
         d.dname,
         e2.ename
FROM     (SELECT *
          FROM   (SELECT   empno,
                           ename,
                           sal,
                           mgrno,
                           deptno
                  FROM     emp
                  WHERE    job = 'SALESMAN'
                  ORDER BY emp.sal DESC)
          WHERE  ROWNUM < 6) e,
         dept d,
         emp e2
WHERE    e.deptno = d.deptno
         AND e.mgrno = e2.empno
ORDER BY sal DESC

At first glance it looks like an unnecessary complication. But in fact the structure of the query forces the optimizer to find the top five highest-paid employees, retrieve their data, and just after that join them with other tables. The benefit is obvious: the join operation is executed for only five rows, instead of hundreds of thousands. The proper use of top-n queries may also increase efficiency when the application retrieves many rows for the given conditions, but the user is actually interested only in a few of them. For example, the application user might need to look through the invoices (with all the details, products, prices, etc.) paid by a selected customer. The user is usually more interested in the most recent invoices, so the data must be sorted in descending order by invoice date - for example:

SELECT   c.custno,
         custname,
         i.invno,
         invdate,
         detno,
         p.prodno,
         proddesc,
         detamount,
         prodprice
FROM     customers c,
         invoices i,
         details d,
         products p
WHERE    c.custno = i.custno
         AND i.invno = d.invno
         AND d.prodno = p.prodno
         AND c.custno = 123
ORDER BY i.invdate DESC;

The real application using a query like this worked very fast at first, but after several months it slowed down, and users began to complain. Searching for the data of 'heavy users' with thousands of invoices took a lot of time and became very irritating. After a few meetings with the application users it was found that, in 95% of the cases, information about the last 20 invoices was sufficient. This data is easy to retrieve by the query:

SELECT c.custno,
       custname,
       i.invno,
       invdate,
       detno,
       p.prodno,
       proddesc,
       detamount,
       prodprice
FROM   (SELECT *
        FROM   (SELECT   custno,
                         invno,
                         invdate
                FROM     invoices
                WHERE    custno = 123
                ORDER BY invdate DESC)
        WHERE  ROWNUM < 21) i,
       details d,
       products p,
       customers c
WHERE  c.custno = i.custno
       AND i.invno = d.invno
       AND p.prodno = d.prodno;

This query sorts data only in the INVOICES table, retrieves just 20 rows, and joins them with other tables. Thanks to this, the query is always executed in less than 0,1 seconds. Of course, the application user has also the opportunity to search more than the last 20
invoices. In such a case, the user has to check the proper check box on the form. This causes the query to execute in the original, full (and slow) version - but at least the user is aware of this.

The examples described above may seem very trivial to many users, but I could give many similar cases: sorting customers by the amount of bought goods, by the number of executed money transfers, by the value
of paid bills; printing 20 operations per bank statement, and so on. In all these situations, moving the conditions and "rownum" clause to the inner query may increase efficiency significantly. For these who are not yet convinced, here are two simple scripts that illustrate the described problems. Both scripts were tested on Oracle9i Enterprise Edition Release 9.2.0.6.0 on HP-UX.

Example 1: Setup

CREATE TABLE emp (
  empno  NUMBER(10),
  ename  VARCHAR2(100),
  deptno NUMBER(10),
  job    VARCHAR2(20),
  sal    NUMBER(10,2),
  mgrno  NUMBER(10));

CREATE TABLE dept (
  deptno NUMBER(10),
  dname  VARCHAR2(100));

CREATE SEQUENCE emp_seq  START WITH 1;
CREATE SEQUENCE dept_seq START WITH 1;

CREATE INDEX emp_idx1 ON emp (
      empno);

CREATE INDEX emp_idx2 ON emp (
      job,
      sal);

CREATE INDEX dept_idx1 ON dept (
      deptno);

CREATE OR REPLACE PROCEDURE test1
AS
  salary   NUMBER(10,2);
  manager  NUMBER(10,2);
BEGIN
  dbms_random.Initialize(1000);
  -- departments

  FOR i IN 1.. 100 LOOP
    INSERT INTO dept (deptno,dname )
    VALUES     (dept_seq.nextval,
                'DEPT' || To_char(dept_seq.currval));
  END LOOP;
  -- employees in every department

  FOR j IN 1.. 100 LOOP
    FOR i IN 1.. 1000 LOOP
      salary := Abs(Mod(dbms_random.random,

                        100000));

      manager := Abs(Mod(dbms_random.random,
                         100000));

      INSERT INTO emp (empno,ename,deptno,job,sal,mgrno )
      VALUES     (emp_seq.nextval,
                  'SCOTT' || To_char(emp_seq.nextval),
                  j,
                  'CLERK',
                  salary,
                  manager);

      INSERT INTO emp (empno,ename,deptno,job,sal,mgrno )
      VALUES     (emp_seq.nextval,
                  'SCOTT' || To_char(emp_seq.nextval),
                  j,
                  'ANALYST',
                  salary,
                  manager);

      INSERT INTO emp (empno,ename,deptno,job,sal,mgrno )
      VALUES     (emp_seq.nextval,
                  'SCOTT' || To_char(emp_seq.nextval),
                  j,
                  'CLERK',
                  salary,
                  manager);

      INSERT INTO emp (empno,ename,deptno,job,sal,mgrno )
      VALUES     (emp_seq.nextval,
                  'SCOTT' || To_char(emp_seq.nextval),
                  j,
                  'SALESMAN',
                  salary,
                  manager);

      INSERT INTO emp (empno,ename,deptno,job,sal,mgrno )

      VALUES     (emp_seq.nextval,
                  'SCOTT' || To_char(emp_seq.nextval),
                  j,
                  'MANAGER',
                  salary,
                  manager);
    END LOOP;

    COMMIT;
  END LOOP;
END;
/

EXEC test1;
ANALYZE TABLE dept COMPUTE STATISTICS;
ANALYZE TABLE emp  COMPUTE STATISTICS;

Query 1: Slow

SELECT ename,
       sal,
       dname,
       manager
FROM   (SELECT   ROWNUM,
                 e.ename,
                 e.sal,
                 d.dname,
                 e2.ename AS manager
        FROM     emp e,
                 dept d,
                 emp e2
        WHERE    e.deptno = d.deptno
                 AND e.mgrno = e2.empno
                 AND e.job = 'SALESMAN'
        ORDER BY e.sal DESC)
WHERE  ROWNUM < 6;

Time: 6,125 s

Query 2: Fast

SELECT   e.ename,
         e.sal,
         d.dname,
         e2.ename
FROM     (SELECT *
          FROM   (SELECT   empno,
                           ename,
                           sal,
                           mgrno,
                           deptno
                  FROM     emp
                  WHERE    job = 'SALESMAN'
                  ORDER BY emp.sal DESC)
          WHERE  ROWNUM < 6) e,
         dept d,
         emp e2
WHERE    e.deptno = d.deptno
         AND e.mgrno = e2.empno
ORDER BY sal DESC

Time: 0,65 s

Example 2: Setup

CREATE TABLE customers (
  custno   NUMBER(10),
  custname VARCHAR2(20));

CREATE TABLE invoices (
  invno   NUMBER(10),
  custno  NUMBER(10),
  invdate DATE);

CREATE TABLE details (
  detno     NUMBER(10),
  invno     NUMBER(10),
  prodno    NUMBER(10),
  detamount NUMBER(10));

CREATE TABLE products (
  prodno    NUMBER(10),
  proddesc  VARCHAR2(20),
  prodprice NUMBER(10,2));

CREATE SEQUENCE cust_seq START WITH 1;
CREATE SEQUENCE inv_seq  START WITH 1;
CREATE SEQUENCE det_seq  START WITH 1;
CREATE SEQUENCE prod_seq START WITH 1;

CREATE INDEX cust_idx1 ON customers (
      custno);

CREATE INDEX inv_idx1 ON invoices (
      invno);

CREATE INDEX inv_idx2 ON invoices (
      custno,
      invdate);

CREATE INDEX det_idx1 ON details (
      detno);

CREATE INDEX det_idx2 ON details (
      invno);

CREATE INDEX prod_idx1 ON products (
      prodno);

CREATE OR REPLACE PROCEDURE test2
AS
  customer  NUMBER(10);
  product   NUMBER(10);
BEGIN
  dbms_random.Initialize(1000);
  -- customers

  FOR i IN 1.. 1000 LOOP
    INSERT INTO customers
    VALUES     (cust_seq.nextval,
                'CUSTOMER ' || To_char(cust_seq.currval));
  END LOOP;
  -- products

  FOR i IN 1.. 1000 LOOP
    INSERT INTO products
    VALUES     (prod_seq.nextval,
                'PRODUCT' || To_char(prod_seq.currval),
                prod_seq.currval);
  END LOOP;
  -- invoices

  FOR i IN 1.. 100000 LOOP
    customer := Abs(Mod(dbms_random.random,
                        1000));


    INSERT INTO invoices
    VALUES     (inv_seq.nextval,
                customer,
                SYSDATE + Abs(Mod(inv_seq.currval,
                                  100)));
    -- details

    FOR j IN 1.. 10 LOOP
      product := Abs(Mod(dbms_random.random,
                         1000));

      INSERT INTO details
      VALUES     (det_seq.nextval,
                  inv_seq.currval,
                  product,
                  Mod(j,
                      5));
    END LOOP;

    COMMIT;
  END LOOP;
END;
/

EXEC test2;

ANALYZE TABLE customers COMPUTE STATISTICS;
ANALYZE TABLE products  COMPUTE STATISTICS;
ANALYZE TABLE invoices  COMPUTE STATISTICS;
ANALYZE TABLE details   COMPUTE STATISTICS;

Query 1: Slow

SELECT   c.custno,
         custname,
         i.invno,
         invdate,
         detno,
         p.prodno,
         proddesc,
         detamount,
         prodprice
FROM     customers c,
         invoices i,
         details d,
         products p
WHERE    c.custno = i.custno
         AND i.invno = d.invno
         AND d.prodno = p.prodno
         AND c.custno = 123
ORDER BY i.invdate DESC;

Time: 0,578 s

Query 2: Fast

SELECT c.custno,
       custname,
       i.invno,
       invdate,
       detno,
       p.prodno,
       proddesc,
       detamount,
       prodprice
FROM   (SELECT *
        FROM   (SELECT   custno,
                         invno,
                         invdate
                FROM     invoices
                WHERE    custno = 123
                ORDER BY invdate DESC)
        WHERE  ROWNUM < 21) i,
       details d,
       products p,
       customers c
WHERE  c.custno = i.custno
       AND i.invno = d.invno
       AND p.prodno = d.prodno;

 Time: 0,093 s

Comments

Why not use standard SQL's "ROW_NUMBER() OVER" window-function when Oracle supports it?

Finally: Beware of tie conditions. In some situations - probably more than one may think at first - the "RANK() OVER" function is what's really needed. I've written about such things at http://troels.arvin.dk/db/rdbms/ (section "Limiting result sets").

Hi,

Good up to very good article.

Since I was involved in performance issues regarding top n queries recently I would like to remark
1. The importance of an index on the sort column created with desc attribute.
(create index emp_idx on emp (sal desc) )
2. The importance of optimizer_mode first_rows_n
3. As suggested here above the rank and the dense rank functions can be used as well, execution plan is often better with a WINDOW SORT PUSHED RANK in the explain plan.

Recently I faced this query, table t1 contained 1.000.000 rows

select * from (select * from t1 order by col1 desc, col2 desc) where rownum < 21

A full table scan on t1 was shown with set autotrace on

I created an index
create index t1_idx1 on t1 (col1 desc, col2 desc)
I gathered statistics with dbms_stats
I ran the same query with optimizer_mode = first_rows_10

I reduces IO with factor 15.000 !

Regards
Guy Lambregts