lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carsten Schnober <>
Subject Re: Small Vocabulary
Date Thu, 02 Aug 2012 08:19:03 GMT
Am 31.07.2012 12:10, schrieb Ian Lea:

Hi Ian,

> Lucene 4.0 allows you to use custom codecs and there may be one that
> would be better for this sort of data, or you could write one.
> In your tests is it the searching that is slow or are you reading lots
> of data for lots of docs?  The latter is always likely to be slow.
> General performance advice as in
> may be
> relevant.  SSDs and loads of RAM never hurt.

You are very right, therer are many results from many docs for the
slower searches performed on that index. However, I am still wondering
about the theoretical implications: having a small vocabulary with many
tokens in an inverted index would yield a rather long list of
occurrences for some/many/all (depending on the actual distribution) of
the search terms.
Thanks for your pointer to the codecs in Lucene 4, I suppose that this
will be the actual point to attack for that scenario. It may be a silly
question, but one that might be of interest for the whole community ;-)
: can someone point me to an in-depth documentation of Lucene 4 codecs,
ideally covering both theoretical backgrounds and implementation? There
are numerous helpful blog entries, presentations, etc. available on the
net, but in case there is some central instance, I have not been able to
find it anyway.
Best regards,

Institut für Deutsche Sprache |
Projekt KorAP                 |
Tel. +49-(0)621-43740789      |
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message