lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Glen Newton <glen.new...@gmail.com>
Subject Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?
Date Thu, 13 Dec 2012 15:09:11 GMT
>Unfortunately, Lucene doesn't properly index
spans (it records the start position but not the end position), so
that limits what kind of matching you can do at search time.

If this could be fixed (i.e. indexing the _end_ of a span) I think all
the things that I want to do, and the things that can now be done in
GATE very easily, would be possible using Mike's suggested method.


-Glen

On Thu, Dec 13, 2012 at 6:27 AM, Michael McCandless
<lucene@mikemccandless.com> wrote:
> On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D.
> <Wu.Stephen@mayo.edu> wrote:
>>>> Is there any (preliminary) code checked in somewhere that I can look at,
>>>> that would help me understand the practical issues that would need to be
>>>> addressed?
>>>
>>> Maybe we can make this more concrete: what new attribute are you
>>> needing to record in the postings and access at search time?
>>
>> For example:
>>  - part of speech of a token.
>>  - syntactic parse subtree (over a span).
>>  - semantically normalized phrase (to canonical text or ontological code).
>>  - semantic group (of a span).
>>  - coreference link.
>
> So for example part-of-speech is a per-Token-position attribute.
>
> Today the easiest way to handle this is to encode these attributes
> into a Payload, which is straightforward (make a custom TokenFilter
> that creates the payload).
>
> At search time you would then use e.g. PayloadTermQuery to decode the
> Payload and do something with it to alter how the query is being
> scored.
>
> For the span-like attributes (eg a syntactic parse, semantically
> normalized phrase) I think you'd need to do something like
> SynonymFilter in your analysis, i.e. insert new tokens at the position
> where the span started.  Unfortunately, Lucene doesn't properly index
> spans (it records the start position but not the end position), so
> that limits what kind of matching you can do at search time.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



-- 
-
http://zzzoot.blogspot.com/
-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message