lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?
Date Thu, 13 Dec 2012 11:27:39 GMT
On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D.
<Wu.Stephen@mayo.edu> wrote:
>>> Is there any (preliminary) code checked in somewhere that I can look at,
>>> that would help me understand the practical issues that would need to be
>>> addressed?
>>
>> Maybe we can make this more concrete: what new attribute are you
>> needing to record in the postings and access at search time?
>
> For example:
>  - part of speech of a token.
>  - syntactic parse subtree (over a span).
>  - semantically normalized phrase (to canonical text or ontological code).
>  - semantic group (of a span).
>  - coreference link.

So for example part-of-speech is a per-Token-position attribute.

Today the easiest way to handle this is to encode these attributes
into a Payload, which is straightforward (make a custom TokenFilter
that creates the payload).

At search time you would then use e.g. PayloadTermQuery to decode the
Payload and do something with it to alter how the query is being
scored.

For the span-like attributes (eg a syntactic parse, semantically
normalized phrase) I think you'd need to do something like
SynonymFilter in your analysis, i.e. insert new tokens at the position
where the span started.  Unfortunately, Lucene doesn't properly index
spans (it records the start position but not the end position), so
that limits what kind of matching you can do at search time.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message