lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex <alexandervomb...@googlemail.com>
Subject Re: Storing payloads without term-position and frequency
Date Thu, 03 Feb 2011 20:49:10 GMT
Hello Grant,

I am currently storing the first term instance only because I just index
each token for an article once. What I want to achieve is an index for
versioned document collections like wikipedia (See this paper
http://www.cis.poly.edu/suel/papers/archive.pdf). 

In detail I create on the first level (Lucene) a document for one
wikipedia article containing all distinct terms of its versions. On the
second level (payloads) I store the frequency information corresponding
to each article version and its terms. If I search now I can find an
article by its term and through the term and its payload I receive
informations about the other versions and how often a token occured (In
my case with one term the payload pos is always 1!). So I look on the
first level and pick only the information from the second level which I
need. By this I can avoid storing informations several times because
most wikipedia versions are very similar (in term context).

This is working so far and I just want to reduce my index size but I
don't know how much I can save by disabling term freqs/pos.
I hope I could explain the problem a little bit. If not just tell me I
try to explain it again. :)

Best regards
Alex

PS: I am currently looking for a bedroom in New York, Brooklyn (Park
Slope or near NYU Poly). Maybe somebody rents a room from 15 Feb until
15 April. :)

Am Donnerstag, den 03.02.2011, 12:38 -0500 schrieb Grant Ingersoll:
> Payloads only make sense in terms of specific positions in the index, so I don't think
there is a way to hack Lucene for it.  You could, I suppose, just store the payload for the
first instance of the term.
> 
> Also, what's the use case you are trying to solve here?  Why store term frequency as
a payload when Lucene already does it (and it probably does it more efficiently)
> 
> -Grant
> 
> On Feb 2, 2011, at 2:35 PM, Alex vB wrote:
> 
> > 
> > Hello everybody,
> > 
> > I am currently using Lucene 3.0.2 with payloads. I store extra information
> > in the payloads about the term like frequencies and therefore I don't need
> > frequencies and term positions stored normally by Lucene. I would like to
> > set f.setOmitTermFreqAndPositions(true) but then I am not able to retrieve
> > payloads. Would it be hard to "hack" Lucene for my requests? Anymore I only
> > store one payload per term if that information makes it easier.
> > 
> > Best regards
> > Alex
> > -- 
> > View this message in context: http://lucene.472066.n3.nabble.com/Storing-payloads-without-term-position-and-frequency-tp2408094p2408094.html
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> > 
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message