db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Craig Russell <Craig.Russ...@Sun.COM>
Subject Re: derby performance and 'order by'
Date Mon, 19 Sep 2005 17:55:02 GMT
Hi Ali,

On Sep 19, 2005, at 10:26 AM, Suavi Ali Demir wrote:

> Actually, it sounds like the problem of finding top 1000 rows out  
> of  166333 rows is different than sorting 166333 rows and maybe it  
> could be optimized.

Indeed.

> There is no need to sort all 166333 but the information that we are  
> only looking 1000 rows would have to be passed all the way down to  
> the point where Derby decides to sort. I have not thought through  
> the details of an algorithm but when nRows we want is substantially  
> smaller than TotalRows then i just feel there should be a better  
> way to pick those nRows. For example, if nRows were 1, then all we  
> had to do would be 1 single pass on 166333 rows to find the max.  
> That is quite different than sorting all and this idea should be  
> possible to generalize on 1<=nRows<TotalRows.

I agree that this would be a useful improvement. Now, how do we tell  
the back end that we want only the first  1000 rows?

Or more generally (as found in competitive products):

How do we tell the back end that we want to skip the first N and  
return the next M rows?

Craig

> Ali
>
>
> Craig Russell <Craig.Russell@Sun.COM> wrote:
> Hi Scott,
>
> From the query plan it appears that your filter selects 166,333  
> rows, of which you want the first 1000 according to the ordering of  
> the order_id column. You can see that this is an effective strategy  
> because             Number of rows qualified=166333 Number of rows  
> visited=166333. There's no time lost visiting rows that don't qualify.
>
> The database has to sort the 166,333 rows because the results are  
> ordered according to the index scan column "time" not according to  
> the order_id column. All of the rows need to be sorted even though  
> you only want the first 1000 rows. I'd guess that the sorting of  
> the 166,333 rows is what accounts for the 15 second delay you are  
> experiencing.
>
> The index on order_id doesn't do you any good because you have a  
> result that isn't indexed on order_id. If this isn't obvious, try  
> to think of an algorithm that would use the order_id index on the  
> result set.
>
> Craig
>
> On Sep 19, 2005, at 9:29 AM, scotto wrote:
>
>> So for the second query:
>>
>> select * from orders
>> where time > '10/01/2002' and time < '11/30/2002'
>> order by order_id;
>>
>> the query plan shows that the index IX_ORDERS_TIME is used to  
>> filter the
>> result set by time.  The order by step does not use the primary  
>> key index to
>> sort the results after the filter step.  My questions:
>>
>> --Is it correct that the sort step not use the primary key index  
>> in this
>> case?
>>
>> --Why is it not possible to use the index on order_id to sort  
>> after the
>> filter has happened?
>>
>> Here is the query plan:
>>
>> Statement Name:
>>     null
>> Statement Text:
>>     select * from orders
>> where time > '10/01/2002' and time < '11/30/2002'
>> order by order_id
>> Parse Time: 0
>> Bind Time: 0
>> Optimize Time: 0
>> Generate Time: 0
>> Compile Time: 0
>> Execute Time: 14329
>> Begin Compilation Timestamp : null
>> End Compilation Timestamp : null
>> Begin Execution Timestamp : 2005-09-19 09:20:06.171
>> End Execution Timestamp : 2005-09-19 09:20:20.5
>> Statement Execution Plan Text:
>> Sort ResultSet:
>> Number of opens = 1
>> Rows input = 166333
>> Rows returned = 1000
>> Eliminate duplicates = false
>> In sorted order = false
>> Sort information:
>>     Number of merge runs=1
>>     Number of rows input=166333
>>     Number of rows output=166333
>>     Size of merge runs=[93695]
>>     Sort type=external
>>     constructor time (milliseconds) = 0
>>     open time (milliseconds) = 14297
>>     next time (milliseconds) = 32
>>     close time (milliseconds) = 0
>>     optimizer estimated row count:        78377.51
>>     optimizer estimated cost:       166745.12
>>
>> Source result set:
>>     Index Row to Base Row ResultSet for ORDERS:
>>     Number of opens = 1
>>     Rows seen = 166333
>>     Columns accessed from heap = {0, 1, 2, 3, 4, 5, 6}
>>         constructor time (milliseconds) = 0
>>         open time (milliseconds) = 0
>>         next time (milliseconds) = 10488
>>         close time (milliseconds) = 0
>>         optimizer estimated row count:        78377.51
>>         optimizer estimated cost:       166745.12
>>
>>         Index Scan ResultSet for ORDERS using index IX_ORDERS_TIME
>> at read committed isolation level using instantaneous share row  
>> locking
>> chosen by the optimizer
>>         Number of opens = 1
>>         Rows seen = 166333
>>         Rows filtered = 0
>>         Fetch Size = 16
>>             constructor time (milliseconds) = 0
>>             open time (milliseconds) = 0
>>             next time (milliseconds) = 3438
>>             close time (milliseconds) = 0
>>             next time in milliseconds/row = 0
>>
>>         scan information:
>>             Bit set of columns fetched=All
>>             Number of columns fetched=2
>>             Number of deleted rows visited=0
>>             Number of pages visited=887
>>             Number of rows qualified=166333
>>             Number of rows visited=166333
>>             Scan type=btree
>>             Tree height=3
>>             start position:
>>     > on first 1 column(s).
>>     Ordered null semantics on the following columns:
>>
>>             stop position:
>>     >= on first 1 column(s).
>>     Ordered null semantics on the following columns:
>>
>>             qualifiers:
>> None
>>             optimizer estimated row count:        78377.51
>>             optimizer estimated cost:       166745.12
>>
>>
>>
>>
>> --scott
>>
>>
>>
>> -----Original Message-----
>> From: Sunitha Kambhampati [mailto:ksunithaghm@gmail.com]
>> Sent: Friday, September 16, 2005 5:55 PM
>> To: Derby Discussion
>> Subject: Re: derby performance and 'order by'
>>
>> Scott Ogden wrote:
>>
>>
>>> I have observed some interesting query performance behavior and am
>>> hoping someone here can explain.
>>>
>>> In my scenario, it appears that an existing index is not being used
>>> for the 'order by' part of the operation and as a result the
>>> performance of certain queries is suffering.
>>>
>>> Can someone explain if this is supposed to be what is happening and
>>> why? Please see below for the specific queries and their performance
>>> characteristics.
>>>
>>> Here are the particulars:
>>>
>>> ---------------------------------
>>>
>>> create table orders(
>>>
>>> order_id varchar(50) NOT NULL
>>>
>>> CONSTRAINT ORDERS_PK PRIMARY KEY,
>>>
>>> amount numeric(31,2),
>>>
>>> time date,
>>>
>>> inv_num varchar(50),
>>>
>>> line_num varchar(50),
>>>
>>> phone varchar(50),
>>>
>>> prod_num varchar(50));
>>>
>>> --Load a large amount of data (720,000 records) into the 'orders'  
>>> table
>>>
>>> --Create an index on the time column as that will be used in the
>>> 'where' clause.
>>>
>>> create index IX_ORDERS_TIME on orders(time);
>>>
>>> --When I run a query against this table returning top 1,000 records,
>>> this query returns very quickly, consistently less than .010  
>>> seconds.
>>>
>>> select * from orders
>>>
>>> where time > '10/01/2002' and time < '11/30/2002'
>>>
>>> order by time;
>>>
>>> --Now run a similarly query against same table, returning the top
>>> 1,000 records.
>>>
>>> --The difference is that the results are now sorted by the  
>>> primary key
>>> ('order_id') rather than 'time'.
>>>
>>> --This query returns slowly, approximately 15 seconds. Why??
>>>
>>> select * from orders
>>>
>>> where time > '10/01/2002' and time < '11/30/2002'
>>>
>>> order by order_id;
>>>
>>> --Now run a third query against the same 'orders' table, removing  
>>> the
>>> where clause
>>>
>>> --This query returns quickly, around .010 seconds.
>>>
>>> select * from orders
>>>
>>> order by order_id;
>>>
>>> ---------------------------------------------
>>>
>>>
>> If you run with derby.language.logQueryPlan=true, the actual query  
>> plans
>> used for the following queries will be written to derby.log. This  
>> will
>> show what indexes was used by the optimizer. Also see
>> http://db.apache.org/derby/docs/10.1/tuning/rtunproper43414.html .
>>
>> Query with 'order by' will require sorting. Usually, sorting  
>> requires an
>> extra step to put the data into the right order. This extra step  
>> can be
>> avoided for data that are already in the right order. For example,  
>> if a
>> single-table query has an ORDER BY on a single column, and there  
>> is an
>> index on that column, sorting can be avoided if Derby uses the  
>> index as
>> the access path.
>>
>> I think in case of your first and third query the optimizer will pick
>> the available index thus probably avoiding requiring the sort step.
>>
>> Your second query involves more work than the first query, since  
>> it has
>> a search condition on time, and an order by order_id. Thus if the
>> optimizer picks the index on time, that will involve a sort step on
>> order_id.
>> ____________
>>
>> Thanks,
>> Sunitha.
>>
>>
>>
>
> Craig Russell
> Architect, Sun Java Enterprise System http://java.sun.com/products/jdo
> 408 276-5638 mailto:Craig.Russell@sun.com
> P.S. A good JDO? O, Gasp!
>

Craig Russell
Architect, Sun Java Enterprise System http://java.sun.com/products/jdo
408 276-5638 mailto:Craig.Russell@sun.com
P.S. A good JDO? O, Gasp!


Mime
View raw message