lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?
Date Tue, 18 Dec 2012 11:36:38 GMT
On Thu, Dec 13, 2012 at 8:32 AM, Carsten Schnober
<> wrote:
> Am 13.12.2012 12:27, schrieb Michael McCandless:
>>> For example:
>>>  - part of speech of a token.
>>>  - syntactic parse subtree (over a span).
>>>  - semantically normalized phrase (to canonical text or ontological code).
>>>  - semantic group (of a span).
>>>  - coreference link.
>> So for example part-of-speech is a per-Token-position attribute.
>> Today the easiest way to handle this is to encode these attributes
>> into a Payload, which is straightforward (make a custom TokenFilter
>> that creates the payload).
>> At search time you would then use e.g. PayloadTermQuery to decode the
>> Payload and do something with it to alter how the query is being
>> scored.
> This is a relatively easy example, but how would deal with e.g.
> annotations that include multiple tokens (as in spans), such as chunks,
> or relations between tokens (and token spans), as in the coreference
> links example given by Steven above?

I think you'd do something like what SynonymFilter does for
multi-token synonyms.

Eg a synonym for "wireless network" - > wifi would insert a new token
("wifi"), overlapped on wireless.

Lucene doesn't store the end span, but if this is really important for
your use case, you could add a payload to that wifi token that would
encode the number of positions that the inserted token spans (2 in
this case), and then the information would be present in the index.

You'd still need to do something custom at read/search time to decode
this end position and do something interesting with it ...

Mike McCandless

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message