lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wu, Stephen T., Ph.D." <Wu.Step...@mayo.edu>
Subject Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?
Date Thu, 13 Dec 2012 15:19:40 GMT
That would be really nice. Full standoff annotations open a lot of doors.

If we had them, though, I'm not sure exactly which of Mike's methods you'd
use?  I thought payloads were completely token-based and could not be
attached to spans regardless.  And the SynonymFilter is really to mimic the
behavior of multiple tokens/span... (though maybe you could add the other
tokens in as "synonyms" and then skip the tokens you added...?).
Mike, is all this stuff possible if we can just index the ends of spans?

stephen


On 12/13/12 9:09 AM, "Glen Newton" <glen.newton@gmail.com> wrote:

>> Unfortunately, Lucene doesn't properly index
> spans (it records the start position but not the end position), so
> that limits what kind of matching you can do at search time.
> 
> If this could be fixed (i.e. indexing the _end_ of a span) I think all
> the things that I want to do, and the things that can now be done in
> GATE very easily, would be possible using Mike's suggested method.
> 
> 
> -Glen
> 
> On Thu, Dec 13, 2012 at 6:27 AM, Michael McCandless
> <lucene@mikemccandless.com> wrote:
>> On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D.
>> <Wu.Stephen@mayo.edu> wrote:
>>>>> Is there any (preliminary) code checked in somewhere that I can look
at,
>>>>> that would help me understand the practical issues that would need to
be
>>>>> addressed?
>>>> 
>>>> Maybe we can make this more concrete: what new attribute are you
>>>> needing to record in the postings and access at search time?
>>> 
>>> For example:
>>>  - part of speech of a token.
>>>  - syntactic parse subtree (over a span).
>>>  - semantically normalized phrase (to canonical text or ontological code).
>>>  - semantic group (of a span).
>>>  - coreference link.
>> 
>> So for example part-of-speech is a per-Token-position attribute.
>> 
>> Today the easiest way to handle this is to encode these attributes
>> into a Payload, which is straightforward (make a custom TokenFilter
>> that creates the payload).
>> 
>> At search time you would then use e.g. PayloadTermQuery to decode the
>> Payload and do something with it to alter how the query is being
>> scored.
>> 
>> For the span-like attributes (eg a syntactic parse, semantically
>> normalized phrase) I think you'd need to do something like
>> SynonymFilter in your analysis, i.e. insert new tokens at the position
>> where the span started.  Unfortunately, Lucene doesn't properly index
>> spans (it records the start position but not the end position), so
>> that limits what kind of matching you can do at search time.
>> 
>> Mike McCandless
>> 
>> http://blog.mikemccandless.com
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message