lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: SpanQuery and database join
Date Mon, 13 Aug 2007 20:16:41 GMT
There is also a Use Cases item on the Wiki...

On Aug 13, 2007, at 3:26 PM, Peter Keegan wrote:

> I suppose it could go under performance or HowTo/Interesting uses of
> SpanQuery.
>
> Peter
>
> On 8/13/07, Erick Erickson <erickerickson@gmail.com> wrote:
>>
>> Thanks for writing this up. Do you think this is an appropriate  
>> subject
>> for the Wiki performance page?
>>
>> Erick
>>
>> On 8/13/07, Peter Keegan <peterlkeegan@gmail.com> wrote:
>>>
>>> I've been experimenting with using SpanQuery to perform what is
>>> essentially
>>> a limited type of database 'join'. Each document in the index  
>>> contains 1
>>> or
>>> more 'rows' of meta data from another 'table'. The meta data are  
>>> simple
>>> tokens representing a column name/value pair ( e.g. color$red or
>>> location$123).  Each row is represented by a span with a maximum  
>>> token
>>> length equal to the maximum number of meta data columns. If a  
>>> column has
>>> multiple values, they are all indexed at the same position ( e.g.
>>> color$red,
>>> color$blue). All rows are added to a single field. The spans are
>>> 'separated'
>>> from each other by introducing a position gap between them via '
>>> Analyzer.getPositionIncrementGap'. This gap should be greater  
>>> than the
>>> number of columns in each span.
>>>
>>> At query time, a SpanNearQuery is constructed to represent the  
>>> meta data
>>> to
>>> join. The 'slop' value is set to the maximum number of meta data  
>>> columns
>>> (minus 1). Using a simple Antlr parser, boolean span queries with  
>>> AND,
>> OR,
>>> NOT can be constructed fairly easily. The SpanQuery is And'd to  
>>> the main
>>> query to build the final query.
>>>
>>> This approach is flexible and pretty efficient because no stored  
>>> fields
>> or
>>> external data are accessed at query time. Span queries are more
>> expensive
>>> compared than other queries, though. We measure performance via
>> throughput
>>> (as opposed to the response time for a single query), and the  
>>> addition
>> of
>>> a
>>> SpanQuery reduced throughput by 5X for ordered spans and 10X for
>> unordered
>>> spans. Still, this may be acceptable for some applications,  
>>> especially
>> if
>>> spans are not used on every query.
>>>
>>> I thought this might interest some of you.
>>>
>>> Peter
>>>
>>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message