Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 612D810203 for ; Fri, 6 Sep 2013 12:13:43 +0000 (UTC) Received: (qmail 12440 invoked by uid 500); 6 Sep 2013 12:13:39 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 12145 invoked by uid 500); 6 Sep 2013 12:13:38 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 12127 invoked by uid 99); 6 Sep 2013 12:13:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Sep 2013 12:13:37 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of benson@basistech.com designates 209.85.220.169 as permitted sender) Received: from [209.85.220.169] (HELO mail-vc0-f169.google.com) (209.85.220.169) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Sep 2013 12:13:33 +0000 Received: by mail-vc0-f169.google.com with SMTP id ib11so2162767vcb.28 for ; Fri, 06 Sep 2013 05:13:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=basistech.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=ONFdJZ4crZLmfWTtkRkilPkFd7CefkhnvrTRUCdSHQk=; b=V/pnBWumBKL+h3bP0lyjB4VI2eebXAe1BfuX8sFQVrF4iSxa3NcaABU3kqeDcRW86O nOHGL8GgzSfiWe9NfUkEU4oFu0lKGS+xGwKgk+sTLhgPdvULSd0fK+8xt3BwwHc+Uquf 0zzQ0FvCgmnED018/XlNgqkHmGO8X9L9cHamY= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=ONFdJZ4crZLmfWTtkRkilPkFd7CefkhnvrTRUCdSHQk=; b=I5lsQYdEoYLnOnQzrkJlyrgUNyb1GEm8gl99ndg2x6MLfwfjWPjNegugSwUsx6CStz KKsaU+UOV8/PaTpJiT3XHJsK/hmY4pK1kMGOGMFaaU4V3imVdW/HAfURfORsB5KfTRwD zNGPZnzxUaasjeZDQjd5tfqdgG8Jfs0Z+Pxq13FpVOPAO07dt5Sa5We+MmG7TuvZPmdY 9eX3xEv1u31+kptV3NpsLq4kwogTHNnVobtf7c5Z5aDD8qnEJnTjnsdMabHUIsGJGvW/ /f6bYwmpktkRik3DW4ghzy34HI8QEBiFt6uHKBqJdlE0YmofV3E6zAUQeia8VcbF/URp h0/A== X-Gm-Message-State: ALoCoQnZGgwqt56QcsnITlLs6Jy1UU1rKGv+9ab9EMrwUufPted+l0swdTFlFGGEREPJi2R91M4L MIME-Version: 1.0 X-Received: by 10.58.73.202 with SMTP id n10mr2075771vev.7.1378469592118; Fri, 06 Sep 2013 05:13:12 -0700 (PDT) Received: by 10.52.109.166 with HTTP; Fri, 6 Sep 2013 05:13:11 -0700 (PDT) In-Reply-To: References: Date: Fri, 6 Sep 2013 08:13:11 -0400 Message-ID: Subject: Re: LookaheadTokenFilter From: Benson Margulies To: java-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org On Fri, Sep 6, 2013 at 7:31 AM, Michael McCandless wrote: > > On Thu, Sep 5, 2013 at 8:44 PM, Benson Margulies wrote: > > I'm trying to work through the logic of reading ahead until I've seen > > marker for the end of a sentence, then applying some analysis to all of the > > tokens of the sentence, and then changing some attributes of each token to > > reflect the results. > > > > The queue of tokens for a position is just a State, so there isn't an API > > there to set any values. > > > > So do I need to subclass Position for myself, store the additional > > information in there, and set the attributes as each token comes by on the > > output side? > > Yes, that sounds right. Either that or, on emitting the eventual > Tokens, apply your logic there (because at that point, after > restoreState, you have access to all the attr values for that token). > > > I would be grateful for a bit more explanation of afterPosition versus > > incrementToken; some of the mock classes call peek from afterPosition, and > > I expected to see peek called in incrementToken based on the javadoc. > > afterPosition is where your subclass can "insert" new tokens. > > I think (it's been a while here...) you are allowed to call peekToken > in afterPosition; this is necessary if your logic about inserting > additional tokens leaving a given position depends on future tokens. > > But: are you doing any new token insertion? Or are you just tweaking > the attributes of the tokens that pass through the filter? If it's > the latter then this class may be overkill ... you could make a simple > TokenFilter.incrementToken that just enumerates & saves all input > tokens, does its processing, then returns those tokens one by one, > instead. I'm not adding tokens yet, but I will be soon, so all of this isn't entirely crazy. The underlying capability here includes decompounding. (I have mixed feelings about just adding all the fragments to the token stream, as it can reduce precision, but there isn't an obvious alternative (except perhaps to suppress the super-common ones)). So, to summarize, logic might be: in incrementToken: If positions.getMaxPos() > -1. just return nextToken(). If not, loop calling peekToken to acquire a sentence, process the sentence, and attach the lemmas and compound-pieces to the Position subclass objects. in afterPosition, as each token comes 'into focus', splat the lemma from the Position into the char term attribute, and insert new tokens as needed for the compound components. Thanks, benson > > > Mike McCandless > > http://blog.mikemccandless.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org