lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eran Sevi" <>
Subject Re: Lucene implementation/performance question
Date Thu, 13 Nov 2008 09:43:01 GMT
I have the same need - to obtain "attributes" for terms stored in some
field. I also need all the results and can't take just the first few docs.

I'm using an older version of lucene and the method i'm using right now is
1. Store the words as usual in some field.
2. Store the attributesof each word in another field aligned with the words.
If you have several types of information, create a field for each type.
3. In order to retrieve the information, run a span query of the words
4. use the positions of the spans to "jump" directly to the stored metadata

This seems cumbersome but it works quite well and quite fast. You don't have
to do several passes - just iterate on the spans. you can easily add as many
attributes to the word as you want as long as they are aligned to the words
fields. The problems are that you are restricted to span queries and that
you have get all the attributes fields for each document and split the
fields before being able to jump to the right position.

However, Mark's suggestion to use PayloadSpanUtil might greatly improve my
solution if for each span we could have the payload information without
running another search. I saw some discussions about extending the Spans
class to include the combined payloads of the terms in each span, but I
don't think it was implemented.


On Wed, Nov 12, 2008 at 9:53 PM, Greg Shackles <> wrote:

> >
> > Right, sounds like you have it spot on. That second * from 3 looks like a
> > possible tricky part.
> I agree that it will be the tricky part but I think as long as I'm careful
> with counting as I iterate through it should be ok (I probably just doomed
> myself by saying that...)
>'d do it essentially how Highlighting do the search
> > to get the docs of interest, and then redo the search somewhat to get the
> > highlights/payloads for an individual doc at a time. You are redoing some
> > work, but if you think about, getting that info for every match (there
> could
> > be tons) doesn't make much since when someone might just look at the top
> > couple results, or say 10 at a time. Depends on your usecase if its
> feasible
> > or not though. Most find it efficient enough to do highlighting with, so
> I'm
> > assuming it should be good enough here.
> In my case I actually need all of the results, no matter how many there
> are.  I imagine my use case is somewhat different than what Lucene is
> typically used for, but if performance is good for payloads then I would
> suspect this method would prove to be pretty quick overall (especially
> compared to the old way I was doing it).  I think this is the route I'm
> going to go, so I will report back with progress as it goes.
> - Greg

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message