lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <grant.ingers...@gmail.com>
Subject Re: Best Practices for getting Strings from a position range
Date Wed, 15 Aug 2007 01:22:32 GMT
Hi Peter,

Could you give more details on this test?  What are you comparing,  
etc.?  Sample queries would be good.  I would like to write up a  
contrib/Benchmark algorithm to begin investigating this and see if  
there is anything that can be done.

Thanks,
Grant

On Aug 10, 2007, at 6:27 PM, Peter Keegan wrote:

> ok, glad we're on the same page.
>
> I did some performance testing with span queries and,  
> unfortunately,  the
> results are discouraging for my intended use. When I added a simple
> SpanNearQuery to existing queries, the throughput decreased by a  
> factor of
> 10+. I figured spans would be expensive, but not that much. I  
> haven't done
> profiling yet, but here's a typical thread stack during execution:
>
> at org.apache.lucene.util.PriorityQueue.downHeap(PriorityQueue.java: 
> 137)
> at org.apache.lucene.util.PriorityQueue.adjustTop 
> (PriorityQueue.java:101)
> at org.apache.lucene.search.spans.NearSpansUnordered.next(
> NearSpansUnordered.java:128)
> at org.apache.lucene.search.spans.SpanScorer.setFreqCurrentDoc(
> SpanScorer.java:83)
> at org.apache.lucene.search.spans.SpanScorer.next(SpanScorer.java:57)
> at org.apache.lucene.search.ConjunctionScorer.next 
> (ConjunctionScorer.java
> :56)
> at org.apache.lucene.search.BooleanScorer2.score 
> (BooleanScorer2.java:327)
> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java: 
> 146)
> at org.apache.lucene.search.Searcher.search(Searcher.java:118)
>
> Peter
>
>
> On 8/10/07, Grant Ingersoll <gsingers@apache.org> wrote:
>>
>> Sorry for the confusion.  I thought you just wanted access to the
>> term info per position.    I think we will have to add something to
>> the Spans like we talked about before.
>>
>> -Grant
>>
>> On Aug 10, 2007, at 11:03 AM, Peter Keegan wrote:
>>
>>> Grant,
>>>
>>> I'm afraid I don't understand how to use this mapper in the context
>>> of a
>>> SpanQuery. It seems like I would have to modify SpanScorer to fetch
>>> payload
>>> data and provide a new method to access the payloads while
>>> iterating through
>>> the documents. If this can be accomplished without modifying Spans,
>>> could
>>> you provide a bit more detail?
>>>
>>> Thanks,
>>> Peter
>>>
>>> On 8/9/07, Peter Keegan <peterlkeegan@gmail.com> wrote:
>>>>
>>>> Hi Grant,
>>>>
>>>> I'm hoping to check this out soon.
>>>>
>>>> Thanks,
>>>> Peter
>>>>
>>>> On 8/7/07, Grant Ingersoll <gsingers@apache.org > wrote:
>>>>>
>>>>> Hi Peter,
>>>>>
>>>>> Give https://issues.apache.org/jira/browse/LUCENE-975 a try.  It
>>>>> provides a TermVectorMapper that loads by position.
>>>>>
>>>>> Still not what ideally what you want, but I haven't had time to
>>>>> scope
>>>>> that one out yet.,
>>>>>
>>>>> -Grant
>>>>>
>>>>> On Jul 24, 2007, at 6:02 PM, Peter Keegan wrote:
>>>>>
>>>>>> Hi Grant,
>>>>>>
>>>>>> No problem - I know you are very busy.  I just wanted to get a
>>>>>> sense for the
>>>>>> timing because I'd like to use this for a release this Fall. If I
>>>>>> can get a
>>>>>> prototype working in the coming weeks AND the performance is
>>>>>> great :) , this
>>>>>> would be terrific. If not, I'll have to fall back on a more  
>>>>>> complex
>>>>>> design
>>>>>> that handles the query outside of Lucene :(
>>>>>>
>>>>>> In the meantime, I'll try playing with LUCENE-868.
>>>>>>
>>>>>> Thanks for the update.
>>>>>> Peter
>>>>>>
>>>>>> On 7/24/07, Grant Ingersoll <grant.ingersoll@gmail.com > wrote:
>>>>>>>
>>>>>>> Sorry, Peter, I haven't had a chance to work on it.  I don't
>>>>>>> see it
>>>>>>> happening this week, but maybe next.
>>>>>>>
>>>>>>> I do think the Mapper approach via TermVectors will work.  It
 
>>>>>>> will
>>>>>>> require implementing a new mapper that orders by position, but
I
>>>>>>> don't think that is too hard.   I started on one on the  
>>>>>>> LUCENE-868
>>>>>>> patch (version 4) but it is not complete.  Maybe you want to
 
>>>>>>> pick
>>>>>>> it up?
>>>>>>>
>>>>>>> With this approach, you would iterate your spans, when you come
>>>>>>> to a
>>>>>>> new doc, you would load the term vector using the
>>>>>>> PositionMapper, and
>>>>>>> then you could index into the positions for the matches in the
>>>>>>> document.
>>>>>>>
>>>>>>> I realize this does not cover the just wanting to get the
>>>>>>> Payload at
>>>>>>> the match issue.  Maybe next week...
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Grant
>>>>>>>
>>>>>>> On Jul 23, 2007, at 8:51 AM, Peter Keegan wrote:
>>>>>>>
>>>>>>>> Any idea on when this might be available (days, weeks...)?
>>>>>>>>
>>>>>>>> Peter
>>>>>>>>
>>>>>>>> On 7/16/07, Grant Ingersoll < gsingers@apache.org>
wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> : Do we have a best practice for going from, say
a SpanQuery
>>>>>>> doc/
>>>>>>>>>> : position information and retrieving the actual
range of
>>>>>>>>> positions of
>>>>>>>>>> : content from the Document?  Is it just to reanalyze
the
>>>>>>> Document
>>>>>>>>>> : using the appropriate Analyzer and start recording
once you
>>>>>>>>> hit the
>>>>>>>>>> : positions you are interested in?    Seems like
Term Vectors
>>>>>>>>> _could_
>>>>>>>>>> : help, but even my new Mapper approach patch (LUCENE-868)
>>>>>>> doesn't
>>>>>>>>>> : really help, because they are stored in a term-centric
>>>>>>> manner.  I
>>>>>>>>>> : guess what I am after is a position centric approach.
 That
>>>>>>>>> is, give
>>>>>>>>>>
>>>>>>>>>> this is kind of what i was suggesting in the last
message i
>>>>>>>>>> sent
>>>>>>>>>> to the java-user thread about paylods and SpanQueries
(which
>>>>>>>>>> i'm
>>>>>>>>>> guessing is what prompted this thread as well)...
>>>>>>>>>>
>>>>>>>>>> http://www.nabble.com/Payloads-and-PhraseQuery-
>>>>>>>>>> tf3988826.html#a11551628
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is one use case, the other is related to the new
patch I
>>>>>>>>> submitted for LUCENE-960.  In this case, I have a
>>>>>>>>> SpanQueryFilter
>>>>>>>>> that identifies a bunch of docs and positions ahead of
time.
>>>>>>>>> Then
>>>>>
>>>>>>>>> the user enters new Span Query and I want to relate the
 
>>>>>>>>> matches
>>>>>>> from
>>>>>>>>> the user query with the positions of matches in the filter
and
>>>>>>> then
>>>>>>>>> show that window.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> my point was that currently, to retrieve a payload
you need a
>>>>>>>>>> TermPositions instance, which is designed for iterating
in  
>>>>>>>>>> the
>>>>>>>>>> order of...
>>>>>>>>>>     seek(term)
>>>>>>>>>>       skipTo(doc)
>>>>>>>>>>          nextPosition()
>>>>>>>>>>             getPayload()
>>>>>>>>>> ...which is great for getting the payload of every
instance
>>>>>>>>>> (ie:position) of a specific term in a given document
(or in
>>>>>>> every
>>>>>>>>>> document) but without serious changes to the Spans
API, the
>>>>>>> ideal
>>>>>>>>>> payload
>>>>>>>>>> API would let you say...
>>>>>>>>>>     skipTo(doc)
>>>>>>>>>>        advance(startPosition)
>>>>>>>>>>          getPayload()
>>>>>>>>>>        while (nextPosition() < endPosition)
>>>>>>>>>>          getPosition()
>>>>>>>>>>
>>>>>>>>>> but this seems like a nearly impossible API to implement
>>>>>>> given the
>>>>>>>>>> natore
>>>>>>>>>> of hte inverted index and the fact that terms aren't
ever
>>>>>>> stored in
>>>>>>>>>> position order.
>>>>>>>>>>
>>>>>>>>>> there's a lot i really don't know/understand about
the lucene
>>>>>>> term
>>>>>>>>>> position internals ... but as i recall, the datastructure
>>>>>>> written
>>>>>>>>>> to disk
>>>>>>>>>> isn't actually a tree structure inverted index, it's
a long
>>>>>>>>>> sequence of
>>>>>>>>>> tuples correct?  so in theory you could scan along
the tuples
>>>>>>>>>> untill you
>>>>>>>>>> find the doc you are interested in, ignoring all
of the term
>>>>>>> info
>>>>>>>>>> along
>>>>>>>>>> the way, then whatever term you happen be on at the
 
>>>>>>>>>> moment, you
>>>>>>>>>> could scan
>>>>>>>>>> along all of the positions until you find one in
the range
>>>>>>> you are
>>>>>>>>>> interested in -- assuming you do, then you record
the current
>>>>>>> Term
>>>>>>>>>> (and
>>>>>>>>>> read your payload data if interested)
>>>>>>>>>
>>>>>>>>> I think the main issue I see is in both the payloads
and the
>>>>>>> matching
>>>>>>>>> case above is that they require a document centric approach.
>>>>>>>>> And
>>>>>>>>> then, for each Document,
>>>>>>>>> you ideally want to be able to just index into an array
so  
>>>>>>>>> that
>>>>>>> you
>>>>>>>>> can go directly to the position that is needed based
on
>>>>>>>>> Span.getStart()
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> if i remember correctly, the first part of this is
easy, and
>>>>>>>>>> relative fast
>>>>>>>>>> -- i think skipTo(doc) on a TermDoc or TermPositions
will
>>>>>>> happily
>>>>>>>>>> scan for
>>>>>>>>>> the first <term,doc> pair with the correct
docId,
>>>>>>> irregardless of
>>>>>>>>>> the term
>>>>>>>>>> ... the only thing i'm not sure about is how efficient
it  
>>>>>>>>>> is to
>>>>>>>>>> loop over
>>>>>>>>>> nextPosition() for every term you find to see if
any of them
>>>>>>> are in
>>>>>>>>>> your
>>>>>>>>>> range ... the best case scenerio is that the first
position
>>>>>>>>>> returned is
>>>>>>>>>> above the high end of your range, in which case you
can stop
>>>>>>>>>> immediately
>>>>>>>>>> and seek to the next term -- butthe worst case is
that you  
>>>>>>>>>> call
>>>>>>>>>> nextPosition() over an over a lot of times before
you get a
>>>>>>>>>> position in
>>>>>>>>>> (or above) your rnage .... an advancePosition(pos)
that
>>>>>>> wokred like
>>>>>>>>>> seek
>>>>>>>>>> or skipTo might be helpful here.
>>>>>>>>>>
>>>>>>>>>> : I feel like I am missing something obvious.  I
would
>>>>>>> suspect the
>>>>>>>>>> : highlighter needs to do this, but it seems to take
the
>>>>>>> reanalyze
>>>>>>>>>> : approach as well (I admit, though, that I have
little
>>>>>>> experience
>>>>>>>>>> with
>>>>>>>>>> : the highlighter.)
>>>>>>>>>>
>>>>>>>>>> as i understand it the default case is to reanalyze,
but  
>>>>>>>>>> if you
>>>>>>>>> have
>>>>>>>>>> TermFreqVector info stored with positions (ie: a
>>>>>>>>>> TermPositionVector) then
>>>>>>>>>> it can use that to construct a TokenStream by iterating
over
>>>>>>>>>> all
>>>>>>>>>> terms and
>>>>>>>>>> writing them into a big array in position order (see
the
>>>>>>>>>> TermSources class
>>>>>>>>>> in the highlighter)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ah, I see that now.  Thanks.
>>>>>>>>>>
>>>>>>>>>> this makes sense when highlighting because it doesn't
know  
>>>>>>>>>> what
>>>>>>>>>> kind of
>>>>>>>>>> fragmenter is going to be used so it needs the whole
>>>>>>> TokenStream,
>>>>>>>>>> but it
>>>>>>>>>> seems less then ideal when you are only interested
in a small
>>>>>>>>>> number of
>>>>>>>>>> position ranges that you know in advance.
>>>>>>>>>>
>>>>>>>>>> : I am wondering if it would be useful to have an
alternative
>>>>>>> Term
>>>>>>>>>> : Vector storage mechanism that was position centric.
>>>>>>> Because we
>>>>>>>>>> : couldn't take advantage of the lexicographic  
>>>>>>>>>> compression, it
>>>>>>>>> would
>>>>>>>>>> : take up more disk space, but it would be a lot
faster for
>>>>>>> these
>>>>>>>>>> kinds
>>>>>>>>>>
>>>>>>>>>> i'm not sure if it's really neccessary to store the
data in a
>>>>>>>>> position
>>>>>>>>>> centric manner, assuming we have a way to "seek"
by position
>>>>>>> like i
>>>>>>>>>> described above -- but then again i don't really
know that
>>>>>>> what i
>>>>>>>>>> described above is all that possible/practical/performant.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I suppose I could use my Mapper approach to organize
things  
>>>>>>>>> in a
>>>>>>>>> position centric way now that I think about it more.
 Just
>>>>>>> means some
>>>>>>>>> unpacking and repacking.  Still, probably would perform
well
>>>>>>> enough
>>>>>>>>> since I can setup the correct structure on the fly. 
I will
>>>>>>> give this
>>>>>>>>> a try.  Maybe even add a Mapper to do this.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Grant
>>>>>>>>>
>>>>>>>>>
>>>>>>> ----------------------------------------------------------------

>>>>>>> --
>>>>>>> ---
>>>>>
>>>>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>>>>>> For additional commands, e-mail: java-dev- 
>>>>>>>>> help@lucene.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------
>>>>>>> Grant Ingersoll
>>>>>>> http://www.grantingersoll.com/
>>>>>>> http://lucene.grantingersoll.com
>>>>>>> http://www.paperoftheweek.com/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ----------------------------------------------------------------

>>>>>>> --
>>>>>>> ---
>>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>
>>>>> --------------------------
>>>>> Grant Ingersoll
>>>>> http://lucene.grantingersoll.com
>>>>>
>>>>> Lucene Helpful Hints:
>>>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> --
>>>>> -
>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://lucene.grantingersoll.com
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message