Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of benson@basistech.com
 designates 209.85.220.169 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAL8PwkbnP8908j1PfUc0xM39Rax_uKuvsKMPg_fUuXLVc8LsTQ@mail.gmail.com>
References: 
 <CALm0H57gnts7bJGJqa21p3gz4=+Odm1uUeYCzK1bjx+SE1ebuA@mail.gmail.com>
	<CAL8PwkbnP8908j1PfUc0xM39Rax_uKuvsKMPg_fUuXLVc8LsTQ@mail.gmail.com>
Date: Fri, 6 Sep 2013 08:13:11 -0400
Message-ID: 
 <CALm0H57dsNasmdAODzKqxLSdWcGnTJGsC1tL4yDJcTqUsKqRVg@mail.gmail.com>
Subject: Re: LookaheadTokenFilter
From: Benson Margulies <benson@basistech.com>
To: java-user@lucene.apache.org
Content-Type: text/plain; charset=UTF-8

On Fri, Sep 6, 2013 at 7:31 AM, Michael McCandless
<lucene@mikemccandless.com> wrote:
>
> On Thu, Sep 5, 2013 at 8:44 PM, Benson Margulies <benson@basistech.com> wrote:
> > I'm trying to work through the logic of reading ahead until I've seen
> > marker for the end of a sentence, then applying some analysis to all of the
> > tokens of the sentence, and then changing some attributes of each token to
> > reflect the results.
> >
> > The queue of tokens for a position is just a State, so there isn't an API
> > there to set any values.
> >
> > So do I need to subclass Position for myself, store the additional
> > information in there, and set the attributes as each token comes by on the
> > output side?
>
> Yes, that sounds right.  Either that or, on emitting the eventual
> Tokens, apply your logic there (because at that point, after
> restoreState, you have access to all the attr values for that token).
>
> > I would be grateful for a bit more explanation of afterPosition versus
> > incrementToken; some of the mock classes call peek from afterPosition, and
> > I expected to see peek called in incrementToken based on the javadoc.
>
> afterPosition is where your subclass can "insert" new tokens.
>
> I think (it's been a while here...) you are allowed to call peekToken
> in afterPosition; this is necessary if your logic about inserting
> additional tokens leaving a given position depends on future tokens.
>
> But: are you doing any new token insertion?  Or are you just tweaking
> the attributes of the tokens that pass through the filter?  If it's
> the latter then this class may be overkill ... you could make a simple
> TokenFilter.incrementToken that just enumerates & saves all input
> tokens, does its processing, then returns those tokens one by one,
> instead.

I'm not adding tokens yet, but I will be soon, so all of this isn't
entirely crazy. The underlying capability here includes decompounding.
(I have mixed feelings about just adding all the fragments to the
token stream, as it can reduce precision, but there isn't an obvious
alternative (except perhaps to suppress the super-common ones)).

So, to summarize, logic might be:

in incrementToken:

If positions.getMaxPos() > -1. just return nextToken(). If not, loop
calling peekToken to acquire a sentence, process the sentence, and
attach the lemmas and compound-pieces to the Position subclass
objects.

in afterPosition, as each token comes 'into focus', splat the lemma
from the Position into the char term attribute, and insert new tokens
as needed for the compound components.

Thanks,
benson


>
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org