From java-user-return-52041-apmail-lucene-java-user-archive=lucene.apache.org@lucene.apache.org Thu Feb 9 22:00:37 2012 Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D57CD9806 for ; Thu, 9 Feb 2012 22:00:37 +0000 (UTC) Received: (qmail 38971 invoked by uid 500); 9 Feb 2012 22:00:35 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 38526 invoked by uid 500); 9 Feb 2012 22:00:34 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 38518 invoked by uid 99); 9 Feb 2012 22:00:34 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Feb 2012 22:00:34 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dameriangr@gmail.com designates 209.85.217.176 as permitted sender) Received: from [209.85.217.176] (HELO mail-lpp01m020-f176.google.com) (209.85.217.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Feb 2012 22:00:26 +0000 Received: by lboi15 with SMTP id i15so1496788lbo.35 for ; Thu, 09 Feb 2012 14:00:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=qllxwZtHEVKnJBshmksJs/hz5ZkDCFS/gXs2/z+0UuY=; b=J1XCKuuKJK48jFc7pyhYObc3/Mu1cIAH+aF76O5aFFMR8hFzRki8sJD6laQ7AW2PLu e0BFkTWl+YaFa8v8NSpT7aOs2YyOv8abTggq0rVK9AVetgsaumQOStQ7n13BDTckazcA VXLBN5+bMLXcQ1J913c9m656xgXNehZKBRfqU= Received: by 10.112.84.68 with SMTP id w4mr1207234lby.37.1328824804635; Thu, 09 Feb 2012 14:00:04 -0800 (PST) Received: from [213.112.147.63] (c-3f9370d5.024-21-6a6b701.cust.bredbandsbolaget.se. [213.112.147.63]) by mx.google.com with ESMTPS id o3sm3168029lbn.2.2012.02.09.14.00.03 (version=SSLv3 cipher=OTHER); Thu, 09 Feb 2012 14:00:04 -0800 (PST) Message-ID: <4F3441C6.7040400@gmail.com> Date: Thu, 09 Feb 2012 22:59:34 +0100 From: Damerian User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0) Gecko/20111222 Thunderbird/9.0.1 MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Access next token in a stream References: <4F341C13.80006@gmail.com> <6C78E97C707B5B4C8CC61D44F87545860D354E@SUEX10-mbx-03.ad.syr.edu> <4F343765.9060901@gmail.com> <6C78E97C707B5B4C8CC61D44F87545860D375B@SUEX10-mbx-03.ad.syr.edu> In-Reply-To: <6C78E97C707B5B4C8CC61D44F87545860D375B@SUEX10-mbx-03.ad.syr.edu> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Στις 9/2/2012 10:51 μμ, ο/η Steven A Rowe έγραψε: > Damerian, > > The technique I mentioned would work for you with a little tweaking: when you see consecutive capitalized tokens, then just set the CharTermAttribute to the joined tokens, and clear the previous token. > > Another idea: you could use ShingleFilter with min size = max size = 2, and then use a following Filter extending FilteringTokenFilter, with an accept() method that examines shingles and rejects ones that don't qualify, something like the following. (Notes: this is untested; I assume you will use the default shingle token separator " "; and this filter will reject all non-shingle terms, so you won't get anything but names, even if you configure ShingleFilter to emit single tokens): > > public final class MyNameFilter extends FilteringTokenFilter { > private static final Pattern NAME_PATTERN > = Pattern.compile("\\p{Lu}\\S*(?:\\s\\p{Lu}\\S*)+"); > private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); > @Override public boolean accept() throws IOException { > return NAME_PATTERN.matcher(termAtt).matches(); > } > } > > Steve > >> -----Original Message----- >> From: Damerian [mailto:dameriangr@gmail.com] >> Sent: Thursday, February 09, 2012 4:15 PM >> To: java-user@lucene.apache.org >> Subject: Re: Access next token in a stream >> >> Στις 9/2/2012 8:54 μμ, ο/η Steven A Rowe έγραψε: >>> Hi Damerian, >>> >>> One way to handle your scenario is to hold on to the previous token, and >> only emit a token after you reach at least the second token (or at end-of- >> stream). Your incrementToken() method could look something like: >>> 1. Get current attributes: input.incrementToken() >>> 2. If previous token does not exist: >>> 2a. Store current attributes as previous token (see >> AttributeSource#cloneAttributes) >>> 2b. Get current attributes: input.incrementToken() >>> 3. Check for& store conditions that will affect previous token's >> attributes >>> 4. Store current attributes as next token (see >> AttributeSource#cloneAttributes) >>> 5. Copy previous token into current attributes (see >> AttributeSource#copyTo); >>> the target will be "this", which is an AttributeSource. >>> 6. Make changes based on conditions found in step #3 above >>> 7. set previous token = next token >>> 8. return true >>> >>> (Everywhere I say "token" I mean "instance of AttributeSource".) >>> >>> The final token in the input stream will need special handling, as will >> single-token input streams. >>> Good luck, >>> Steve >>> >>>> -----Original Message----- >>>> From: Damerian [mailto:dameriangr@gmail.com] >>>> Sent: Thursday, February 09, 2012 2:19 PM >>>> To: java-user@lucene.apache.org >>>> Subject: Access next token in a stream >>>> >>>> Hello i want to implement my custom filter, my wuestion is quite simple >>>> but i cannot find a solution to it no matter how i try: >>>> >>>> How can i access the TermAttribute of the next token than the one i >>>> currently have in my stream? >>>> >>>> For example in the phrase "My name is James Bond" if let's say i am in >>>> the token [My], i would like to be able to check the TermAttribute of >>>> the following token [name] and fix my position increment accordingly. >>>> >>>> Thank you in advance! >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>> For additional commands, e-mail: java-user-help@lucene.apache.org >> Hi Steve, >> Thank you for your immediate reply. i will try your solution but i feel >> that it does not solve my case. >> What i am trying to make is a filter that joins together two >> terms/tokens that start with a capital letter (it is trying to find all >> the Names/Surnames and make them one token) so in my aforementioned >> example when i examine [James] even if i store the TermAttribute to a >> temporary token how can i check the next one [Bond] , to join them >> without actually emmiting (and therefore creating a term in my inverted >> index) that has [James] on its own. >> Thank you again for your insight and i would relly appreciate any other >> views on the matter. >> >> Regards, Damerian >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org I think my solution in almost full now only one question you mentioned "clear the previous token. ". Is there a built-in method for doing that? In the begining i thought that if i put my new token into the same position increment it would "overwrite" the previous one , but what i succeeded was to simply inject code.. my method that does that so far is this: @Override public boolean incrementToken() throws IOException { if (!input.incrementToken()) { return false; } //Case were the previous token WAS NOT starting with capital letter and the rest small if (previousTokenCanditateMainName == false) { if (CheckIfMainName(termAtt.term())) { previousTokenCanditateMainName = true; tempString = this.termAtt.term(); /*This is the*/ // myToken.offsetAtt=this.offsetAtt; /*Token i need to "delete"*/ tempStartOffset = this.offsetAtt.startOffset(); tempEndOffset = this.offsetAtt.endOffset(); //this.nextInputStreamToken.clearAttributes(); return true; } else { return true; } } //Case were the previous token WAS a Proper name (starting with Capital and continuiing with small letters) else { if (CheckIfMainName(termAtt.term())) { previousTokenCanditateMainName = false; posIncrAtt.setPositionIncrement(0); String myString=tempString + TOKEN_SEPARATOR + this.termAtt.term(); //termAtt.setTermBuffer(myString, tempStartOffset, myString.length()); termAtt.setTermBuffer(tempString + TOKEN_SEPARATOR + this.termAtt.term()); offsetAtt.setOffset(tempStartOffset, this.offsetAtt.endOffset()); return true; } else { previousTokenCanditateMainName = false; return true; } } } The checkIfMain() method is a simple custom made method to decide whether the token fullfills the criteria. Once again thank you very much for your help, and the time that you spend in helping me regards /Damerian --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org