jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (JCR-2959) SQL2 QueryEngine should use Lucene for sort
Date Fri, 12 Aug 2011 17:46:35 GMT

    [ https://issues.apache.org/jira/browse/JCR-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084283#comment-13084283

Jukka Zitting commented on JCR-2959:

If I understand correctly, the implementation in the patch still doesn't use the actual fields
in Lucene for searching, it just loads the node using Session.getNodeByIdentifier() and uses
the OperandEvaluator to get the values by which documents are to be compared. It seems to
me that this is pretty much equivalent in terms of disk accesses to what the existing sorting
mechanism does and the speedup you're seeing is probably simply caused by the sorting mechanism
memorizing the values returned by the OperandEvaluator. I would expect that difference to
fade away if the test set becomes larger than what fits into memory.

Some math to give us more background: The execution of a single-selector query currently consists
of the following main step:

    1. Constructing the Lucene query
    2. Executing the Lucene query
    3. Fetching all matching nodes to memory
    4. Sorting the nodes

Step 1 is just an in-memory mapping, so not too time-consuming. Steps 2 and 3 are probably
the most time-consuming bits here, but both outside the scope of this issue. Finally, step
4 consists of O(n log n) in-memory operations, where n is the number of returned nodes. That's
some time, but as these are in-memory operations the overhead should be pretty small compared
to the previous steps, whose worst case behavior is O(n) disk accesses.

The proposed change here is to switch steps 3 and 4 around and use Lucene for the sorting.
The time required to sort all the results is still O(n log n) and may well require some extra
disk accesses to fetch the sort fields from the Lucene index. However, the time taken in this
step should still be significantly lower than in fetching all the matching nodes to memory.
Thus the big benefit that we can gain from this optimization comes in cases where we *don't*
want to fetch all the matching nodes to memory.

A good test case for such a scenario could be a data set of 1m nodes, of which a query targets
a subset of 100k nodes, and only the first 100 nodes of the sorted results are actually being
read. With a proper implementation of this mechanism, the Lucene search would find all the
matching 100k documents and read the sort fields (in the Lucene index) of those documents
for quickly ordering the results. That should still be pretty fast compared to fetching all
the 100k nodes into memory, which is what the current implementation (and the current patch)
does. And then fetching only 100 first results should take no time at all. I'd expect us to
reach up to an order of magnitude of performance improvement for such a scenario.

> SQL2 QueryEngine should use Lucene for sort
> -------------------------------------------
>                 Key: JCR-2959
>                 URL: https://issues.apache.org/jira/browse/JCR-2959
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>    Affects Versions: 2.2.1
>            Reporter: Robert Seidel
>            Priority: Minor
>         Attachments: JCR-2959.patch
> Currently all SQL2 queries with "order by" expression are sorted with Collections.sort
- which is slow and very bad if there are lots of hits. Lucene should be used for sorting
hits instead.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message