lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carsten Schnober <>
Subject Re: Small Vocabulary
Date Tue, 07 Aug 2012 09:13:32 GMT
Am 07.08.2012 10:20, schrieb Danil ŢORIN:

Hi Danil,

> If you do intersection (not join), maybe it make sense to put every
> thing into 1 index?

Just a note on that: my application performs intersections and joins
(unions) on the results, depending on the query. So the index structure
has to be ready for both, but intersections are clearly more complicated.

> Just transform your input like "brown fox" into "ADJ:brown|<your
> payload> NOUN:fox|<other payload>"

I understand that this denotes "ADJ" and "NOUN" to be interpreted as the
actual token and "brown" and "fox" as payloads (followed by <other
payload>), right?

This is a very neat approach and I have vaguely considered that. One
problem is that I aim for a very high level of flexibility, meaning that
additional annotations have to be addable at any point and different
tokenizations apply. However, I will re-consider your suggestion,
possibly applying one of multiple tokenizations as a default in this sense.

> Of course I'm not aware of all the details, so my solution might not
> be applicable to your project.
> Maybe you could share more details, so this won't transform in "XY problem".
> Keep in mind : always optimize your index for the query usecase,
> instead of blindly processing the input data.

Thanks for that reminder; this becomes quite difficult in my scenario
though since we want to allow for flexible changes in the index types,
representing different annotations, tokenization logics etc.

Institut für Deutsche Sprache |
Projekt KorAP                 |
Tel. +49-(0)621-43740789      |
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message