lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Danil ŢORIN <>
Subject Re: Small Vocabulary
Date Tue, 07 Aug 2012 09:31:53 GMT
I mean "ADJ:brown" as a token and only the <payload> as payload, since
you probably only use it for some scoring/postprocessing not the
actual matching.

You can even write a filter that will emit both tokens "ADJ" and
"AJD:brown" on same position (so you'll be able to do phrase queries),
and still maintain join capability.

On Tue, Aug 7, 2012 at 12:13 PM, Carsten Schnober
<> wrote:
> Am 07.08.2012 10:20, schrieb Danil ŢORIN:
> Hi Danil,
>> If you do intersection (not join), maybe it make sense to put every
>> thing into 1 index?
> Just a note on that: my application performs intersections and joins
> (unions) on the results, depending on the query. So the index structure
> has to be ready for both, but intersections are clearly more complicated.
>> Just transform your input like "brown fox" into "ADJ:brown|<your
>> payload> NOUN:fox|<other payload>"
> I understand that this denotes "ADJ" and "NOUN" to be interpreted as the
> actual token and "brown" and "fox" as payloads (followed by <other
> payload>), right?
> This is a very neat approach and I have vaguely considered that. One
> problem is that I aim for a very high level of flexibility, meaning that
> additional annotations have to be addable at any point and different
> tokenizations apply. However, I will re-consider your suggestion,
> possibly applying one of multiple tokenizations as a default in this sense.
>> Of course I'm not aware of all the details, so my solution might not
>> be applicable to your project.
>> Maybe you could share more details, so this won't transform in "XY problem".
>> Keep in mind : always optimize your index for the query usecase,
>> instead of blindly processing the input data.
> Thanks for that reminder; this becomes quite difficult in my scenario
> though since we want to allow for flexible changes in the index types,
> representing different annotations, tokenization logics etc.
> Best,
> Carsten
> --
> Institut für Deutsche Sprache |
> Projekt KorAP                 |
> Tel. +49-(0)621-43740789      |
> Korpusanalyseplattform der nächsten Generation
> Next Generation Corpus Analysis Platform
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message