incubator-jena-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andy Seaborne (JIRA)" <>
Subject [jira] [Commented] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries
Date Sat, 13 Aug 2011 16:56:27 GMT


Andy Seaborne commented on JENA-89:

In creating the tests, a bug was discovered.

The heap needs to keep the least N items. So when a new item to keep comes in, the greatest
item so far needs to be evicted, not the least.  In other words, the heap is kept in maximum
order and the test to to look at the max of the N items (heap top) to see if the new binding
is less than the max least N so far.

Patch attached.  heap uses a reversed comparator.

Also (minor) avoid a copy by Arrays.asList.

> Avoid a total sort for ORDER BY + LIMIT queries
> -----------------------------------------------
>                 Key: JENA-89
>                 URL:
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: arq, optimization, scalability, sparql
>         Attachments: ARQ_JENA-89_r1156212.patch, JENA-89-Top-FixHeapCollation.patch,
JENA-89-TopTests.patch, JENA-89-TopTests.patch
> In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result set and
then produces the first N, according to the specified LIMIT.
> As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] (which is
based on a priority heap) to avoid a sort operation.
> ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor [4] will
need to use a new QueryIterTopN instead of QueryIterSort + QueryIterSlice. A new TransformOrderByLimit
to be used by Optimize is also necessary.
> ORDER BY + LIMIT queries are typically used to construct the first page when results
are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. (Often users stop at
the first page). Ideally, we could cache the ORDER BY and implement the OFFSET|LIMIT using
results from the cache. However, the improvement described by this issue is limited to the
ORDER BY + LIMIT case for which a priority heap is a good enough solution.
> Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in case of
small values specified on the LIMIT. 
>   [1]
>   [2]
>   [3]
>   [4]

This message is automatically generated by JIRA.
For more information on JIRA, see:


View raw message