db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brett Wooldridge <brett.wooldri...@gmail.com>
Subject Re: Filtering using the sort order on two columns
Date Tue, 04 Aug 2009 09:02:42 GMT
Educated guess, but I think an index scan will not be used because the
predicate provides no "stop condition".  See "What's Optimizable?" here:

http://db.apache.org/derby/docs/10.0/manuals/tuning/perf43.html

*Possibly* can you force the use of an index by overriding the optimizer:

http://db.apache.org/derby/docs/10.5/tuning/ctunoptimzoverride.html

>From the first document:

"Sometimes a table scan is the most efficient way to access data, even if a
potentially useful index is available. For example, if the statement returns
virtually all the data in the table, it is more efficient to go straight to
the table instead of looking values up in an index, because then Derby is
able to avoid the intermediate step of retrieving the rows from the index
lookup values."

The trouble is, because your query is open-ended (using > with no <), Derby
doesn't know how many rows will be returned ... maybe it's the entire table
... in which case a table-scan is more efficient.  You *might* also try
using ROW_NUMBER() to limit the query and maybe trick the optimizer (search
the FAQ for LIMIT), but somehow I don't think the optimizer is that smart.

Any reason not to cache the result of the first query, and manually step
through the results rather than constantly re-running a query to get the
"next" row?

-Brett

On Tue, Aug 4, 2009 at 3:30 PM, Daniel Noll <daniel@nuix.com> wrote:

> Hi all.
>
> I have a table and index like this:
>
>    CREATE TABLE Terms (
>        id        INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY NOT NULL,
>        field     VARCHAR(255) NOT NULL,
>        text      VARCHAR(255) NOT NULL,
>        docfreq   INTEGER NOT NULL
>    )
>
>    CREATE INDEX TermsFieldAndText ON Terms (field, text)
>
> Here's some example data.  Table is shown in sorted order as it would
> be listed in the index above:
>
> | field     | text      | docfreq      |
> ----------------------------------------
> | body      | hello     | 10           |
> | tag       | home      | 100          |
> | tag       | work      | 80           |
>
> For a given input term (let's say tag:home) I want to find the _next_
> entry in the table (so it should return tag:work).  My naive
> implementation does this:
>
>    SELECT field, text, docfreq
>    FROM Terms
>    WHERE (field = ? AND text > ?) OR field > ?
>    ORDER BY field ASC, text ASC
>
> This does the right thing as far as the output is concerned, but Derby
> does a table scan.  Looking at the query, my impression would be that
> in the worse case scenario, Derby should do two index scans.  Or, if
> it's smart, one index scan.
>
> I'm aware that I might be able to work around this by having instead a
> field with the concatenation of both fields with
> CONCAT(field,'\0',text).  But I wanted to check first whether there is
> a way I can maintain my relatively understandable table structure
> before resorting to hacks.  Basically I just need the data in index
> order, starting from a given point... so I figure there is a simple
> solution.
>
> Daniel
>
>
> --
> Daniel Noll                            Forensic and eDiscovery Software
> Senior Developer                              The world's most advanced
> Nuix                                                email data analysis
> http://nuix.com/                                and eDiscovery software
>

Mime
View raw message