lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carsten Schnober <>
Subject Small Vocabulary
Date Mon, 30 Jul 2012 13:07:27 GMT
Dear list,
I'm considering to use Lucene for indexing sequences of part-of-speech
(POS) tags instead of words; for those who don't know, POS tags are
linguistically motivated labels that are assigned to tokens (words) to
describe its morpho-syntactic function. Instead of sequences of words, I
would like to index sequences of tags, for instance "ART ADV ADJA NN".
The aim is to be able to search (efficiently) for occurrences of "ADJA".

The question is whether Lucene can be applied to deal with that data
cleverly because the statistical properties of such pseudo-texts is very
distinct from natural language texts and make me wonder whether Lucene's
inverted indexes are suitable. Especially the small vocabulary size (<50
distinct tokens, depending on the tagging system) is problematic, I suppose.

First trials for which I have implemented an analyzer that just outputs
Lucene tokens such as "ART", "ADV", "ADJA", etc. yield results that are
not exactly perfect regarding search performance, in a test corpus with
a few million tokens. The number of tokens in production mode is
expected to be much larger, so I wonder whether this approach is
promising at all.
Does Lucene (4.0?) provide optimization techniques for extremely small
vocabulary sizes?

Thank you very much,
Carsten Schnober

Institut für Deutsche Sprache |
Projekt KorAP                 |
Tel. +49-(0)621-43740789      |
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message