lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Lea <>
Subject Re: Small Vocabulary
Date Tue, 31 Jul 2012 10:10:15 GMT
Lucene 4.0 allows you to use custom codecs and there may be one that
would be better for this sort of data, or you could write one.

In your tests is it the searching that is slow or are you reading lots
of data for lots of docs?  The latter is always likely to be slow.
General performance advice as in may be
relevant.  SSDs and loads of RAM never hurt.


On Mon, Jul 30, 2012 at 2:07 PM, Carsten Schnober
<> wrote:
> Dear list,
> I'm considering to use Lucene for indexing sequences of part-of-speech
> (POS) tags instead of words; for those who don't know, POS tags are
> linguistically motivated labels that are assigned to tokens (words) to
> describe its morpho-syntactic function. Instead of sequences of words, I
> would like to index sequences of tags, for instance "ART ADV ADJA NN".
> The aim is to be able to search (efficiently) for occurrences of "ADJA".
> The question is whether Lucene can be applied to deal with that data
> cleverly because the statistical properties of such pseudo-texts is very
> distinct from natural language texts and make me wonder whether Lucene's
> inverted indexes are suitable. Especially the small vocabulary size (<50
> distinct tokens, depending on the tagging system) is problematic, I suppose.
> First trials for which I have implemented an analyzer that just outputs
> Lucene tokens such as "ART", "ADV", "ADJA", etc. yield results that are
> not exactly perfect regarding search performance, in a test corpus with
> a few million tokens. The number of tokens in production mode is
> expected to be much larger, so I wonder whether this approach is
> promising at all.
> Does Lucene (4.0?) provide optimization techniques for extremely small
> vocabulary sizes?
> Thank you very much,
> Carsten Schnober
> --
> Institut für Deutsche Sprache |
> Projekt KorAP                 |
> Tel. +49-(0)621-43740789      |
> Korpusanalyseplattform der nächsten Generation
> Next Generation Corpus Analysis Platform
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message