lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Fenech <>
Subject Indexing documents with pre-calculated term frequencies
Date Wed, 11 Feb 2015 09:54:30 GMT

I would like to index documents which contain term frequencies instead of
the actual text. For example, instead of getting "The big wolf ate the big
sheep" I would get "the|2 big|2 wolf|1 ate|1 sheep|1". An easy way to index
this would be to convert the frequencies back into text, so into something
like "the the big big wolf ate sheep", but it does not look that elegant
since I would be expanding the text, just to have Lucene "compress" it

Any ideas? Or directions I should look into?

I am considering:
- Custom Analyzer (so I expand on while generating the TokenStream from the
compressed text)

Thanks in Advance,


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message