lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Sokolov <>
Subject Re: Small Vocabulary
Date Mon, 06 Aug 2012 18:29:34 GMT
There was some interesting work done on optimizing queries including 
very common words (stop words) that I think overlaps with your problem. 
See this blog post 
from the Hathi Trust.

The upshot in a nutshell was that queries including terms with very 
large postings lists (ie high occurrences) were slow, and the approach 
they took to dealing with this was to index n-grams (ie pairs and 
triplets of adjacent tokens).  However I'm not sure this would help much 
if your queries will typically include only a single token.


On 07/30/2012 09:07 AM, Carsten Schnober wrote:
> Dear list,
> I'm considering to use Lucene for indexing sequences of part-of-speech
> (POS) tags instead of words; for those who don't know, POS tags are
> linguistically motivated labels that are assigned to tokens (words) to
> describe its morpho-syntactic function. Instead of sequences of words, I
> would like to index sequences of tags, for instance "ART ADV ADJA NN".
> The aim is to be able to search (efficiently) for occurrences of "ADJA".
> The question is whether Lucene can be applied to deal with that data
> cleverly because the statistical properties of such pseudo-texts is very
> distinct from natural language texts and make me wonder whether Lucene's
> inverted indexes are suitable. Especially the small vocabulary size (<50
> distinct tokens, depending on the tagging system) is problematic, I suppose.
> First trials for which I have implemented an analyzer that just outputs
> Lucene tokens such as "ART", "ADV", "ADJA", etc. yield results that are
> not exactly perfect regarding search performance, in a test corpus with
> a few million tokens. The number of tokens in production mode is
> expected to be much larger, so I wonder whether this approach is
> promising at all.
> Does Lucene (4.0?) provide optimization techniques for extremely small
> vocabulary sizes?
> Thank you very much,
> Carsten Schnober

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message