lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From SUJIT PAL <sujit....@comcast.net>
Subject Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?
Date Thu, 13 Dec 2012 22:29:41 GMT
Hi Glen,

I don't believe you can attach a single payload to multiple tokens. What I did for a similar
requirement was to combine the tokens into a single "_" delimited single token and attached
the payload to it. For example:

The Big Bad Wolf huffed and puffed and blew the house of the Three Little Pigs down.

Now assume "Big Bad Wolf" and "Three Little Pigs" are spans to which I would like to attach
payloads to. I run the tokens through a custom tokenizer that produces:

The Big_Bad_Wolf$payload1 huffed and puffed and blew the house of the Three_Little_Pigs$payload2
down.

In my case this makes sense, ie I can treat the span as a single unit. Not sure about your
use case.

HTH
Sujit

On Dec 13, 2012, at 2:08 PM, Glen Newton wrote:

> Cool! Sounds great!  :-)
> 
> Any pointers to a (Lucene) example that attaches a payload to a
> start..end span that is more than one token?
> 
> thanks,
> -Glen
> 
> On Thu, Dec 13, 2012 at 5:03 PM, Lance Norskog <goksron@gmail.com> wrote:
>> I should not have added that note. The Opennlp patch gives a concrete
>> example of adding an annotation to text.
>> 
>> 
>> On 12/13/2012 01:54 PM, Glen Newton wrote:
>>> 
>>> It is not clear this is exactly what is needed/being discussed.
>>> 
>>> From the issue:
>>> "We are also planning a Tokenizer/TokenFilter that can put parts of
>>> speech as either payloads (PartOfSpeechAttribute?) on a token or at
>>> the same position."
>>> 
>>> This adds it to a token, not a span. 'same position' does not suggest
>>> it also records the end position.
>>> 
>>> -Glen
>>> 
>>> On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog <goksron@gmail.com> wrote:
>>>> 
>>>> Parts-of-speech is available now, in the indexer.
>>>> 
>>>> LUCENE-2899 adds OpenNLP to the Lucene&Solr codebase. It does
>>>> parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an
>>>> Apache
>>>> project for natural-language processing.
>>>> 
>>>> Some parts are in Solr that could be in Lucene.
>>>> 
>>>> https://issues.apache.org/jira/browse/lucene-2899
>>>> 
>>>> 
>>>> On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:
>>>>>>> 
>>>>>>> Is there any (preliminary) code checked in somewhere that I can
look
>>>>>>> at,
>>>>>>> that would help me understand the practical issues that would
need to
>>>>>>> be
>>>>>>> addressed?
>>>>>> 
>>>>>> Maybe we can make this more concrete: what new attribute are you
>>>>>> needing to record in the postings and access at search time?
>>>>> 
>>>>> For example:
>>>>>   - part of speech of a token.
>>>>>   - syntactic parse subtree (over a span).
>>>>>   - semantically normalized phrase (to canonical text or ontological
>>>>> code).
>>>>>   - semantic group (of a span).
>>>>>   - coreference link.
>>>>> 
>>>>> stephen
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>> 
>>> 
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> 
> 
> -- 
> -
> http://zzzoot.blogspot.com/
> -
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message