lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carsten Schnober <>
Subject Re: Small Vocabulary
Date Tue, 07 Aug 2012 07:29:07 GMT
Am 06.08.2012 20:29, schrieb Mike Sokolov:

Hi Mike,

> There was some interesting work done on optimizing queries including
> very common words (stop words) that I think overlaps with your problem.
> See this blog post
> from the Hathi Trust.
> The upshot in a nutshell was that queries including terms with very
> large postings lists (ie high occurrences) were slow, and the approach
> they took to dealing with this was to index n-grams (ie pairs and
> triplets of adjacent tokens).  However I'm not sure this would help much
> if your queries will typically include only a single token.

This is very interesting for our use case indeed. However, you are right
that indexing n-grams is not (per sé) a solution for my given problem
because I'm working on an application using multiple indexes. A query
for one isolated frequent term will indeed be rare presumably, or at
least rare enough to tolerate slow response times, but the results will
typically be intersected with results from other indexes.

To illustrate this more practically: the index I described having
relatively few distinct and partially extremely frequent tokens indexes
part-of-speech (POS) tags with positional information stored in the
payload. A parallel index indexes actual text; a typical query may look
for a certain POS tag in one index and a word X at the same position
with a matching payload in the other index. So both indexes need to be
queries completely before the intersection can be performed.


Institut für Deutsche Sprache |
Projekt KorAP                 |
Tel. +49-(0)621-43740789      |
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message