lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <>
Subject Re: Indexing documents with pre-calculated term frequencies
Date Wed, 11 Feb 2015 16:18:23 GMT
  An example why you might do this is if your input is a term vector (ie 
a list of unique terms with weights) rather than a text in the usual 
sense.  It does seem as if the best way forward in this case is to 
generate a text with repeated terms.  I looked at the alternative and it 
is quite involved in low level Lucene code.


On 02/11/2015 08:01 AM, Erick Erickson wrote:
> You could consider payloads but why do you want to do this?
> What's the use case here? Sounds a little like an XY problem, you're
> asking us how to do something without explaining the why; there
> may be other ways to accomplish your task.
> For instance, there's the "termfreq" function, which an be returned
> as a field in the doc, see:
> Best,
> Erick
> On Wed, Feb 11, 2015 at 4:54 AM, Stephen Fenech <> wrote:
>> Hi,
>> I would like to index documents which contain term frequencies instead of
>> the actual text. For example, instead of getting "The big wolf ate the big
>> sheep" I would get "the|2 big|2 wolf|1 ate|1 sheep|1". An easy way to index
>> this would be to convert the frequencies back into text, so into something
>> like "the the big big wolf ate sheep", but it does not look that elegant
>> since I would be expanding the text, just to have Lucene "compress" it
>> again.
>> Any ideas? Or directions I should look into?
>> I am considering:
>> - Custom Analyzer (so I expand on while generating the TokenStream from the
>> compressed text)
>> Thanks in Advance,
>> Stephen
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message