Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of benson@basistech.com
 designates 209.85.220.179 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAOdYfZXkU_XuzXaQ-E_8oYyBDCYPK+AwP+7B0im_M3tdOj+iZA@mail.gmail.com>
References: 
 <CALm0H55u1SW4tncN3J3k7tJL25skVyfRwSDCJedr8fkqC=Ww0g@mail.gmail.com>
	<CAOdYfZVXp6a7qESAgYeACw7GLahSApeqXMOL1-WGyf-9A9_zgg@mail.gmail.com>
	<CALm0H577Vr5T3a=BrcsExd9mxC5bOxe-jiCZ7hVcervsm1qCQw@mail.gmail.com>
	<CAOdYfZXkU_XuzXaQ-E_8oYyBDCYPK+AwP+7B0im_M3tdOj+iZA@mail.gmail.com>
Date: Sat, 7 Sep 2013 07:44:01 -0400
Message-ID: 
 <CALm0H54B3jCGJzv+x=Oy5rwPWmpCjCzNuh8Bg4ZThhpragKbFA@mail.gmail.com>
Subject: Re: PositionLengthAttribute
From: Benson Margulies <benson@basistech.com>
To: java-user@lucene.apache.org
Content-Type: text/plain; charset=UTF-8

In Japanese, compounds are just decompositions of the input string. In
other languages, compounds can manufacture entire tokens from thin
air. In those cases, it's something of a question how to decide on the
offsets. I think that you're right, eventually, insofar as there's
some offset in the original that might as well be blamed for any given
component.


On Fri, Sep 6, 2013 at 9:37 PM, Robert Muir <rcmuir@gmail.com> wrote:
> On Fri, Sep 6, 2013 at 9:32 PM, Benson Margulies <benson@basistech.com> wrote:
>> On Fri, Sep 6, 2013 at 9:28 PM, Robert Muir <rcmuir@gmail.com> wrote:
>>> its the latter. the way its designed to work i think is illustrated
>>> best in kuromoji analyzer where it heuristically decompounds nouns:
>>>
>>> if it decompounds ABCD into AB + CD, then the tokens are AB and CD.
>>> these both have posinc=1.
>>> however (to compensate for precision issue you mentioned on the other
>>> thread), it keeps the full compound as a synonym too (there are some
>>> papers benchmarking this approach for decompounding, just think of IDF
>>> etc sorting things out).
>>> so that ABCD synonym has position increment 0, and it "sits" at the
>>> same position as the first token (AB). but it has positionLength=2,
>>> which basically keeps the information in the chain that this "synonym"
>>> spans across both AB and CD.
>>>
>>> so the output is like this: AB(posinc=1,posLength=1),
>>> ABCD(posinc=0,posLength=2), CD(posinc=1, posLength=1)
>>
>> I suppose this works best if you actually know the offsets of the
>> pieces. In disassembling German, this is not always straightforward.
>>
>
> i dont really see how it has anything to do with natural languages?
> its just the way you represent the compound components in the
> tokenstream.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org