lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damerian <dameria...@gmail.com>
Subject Re: Access next token in a stream
Date Thu, 09 Feb 2012 22:14:54 GMT
Στις 9/2/2012 11:12 μμ, ο/η Steven A Rowe έγραψε:
> Damerian,
>
> When I said "clear the previous token", I was referring to the pseudo-code I gave in
my first response to you.  There is no built-in method to do that.  If you want to conditionally
output tokens, you should store AttributeSource clones, as in my pseudo-code.
>
> Steve
>
>> -----Original Message-----
>> From: Damerian [mailto:dameriangr@gmail.com]
>> Sent: Thursday, February 09, 2012 5:00 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Access next token in a stream
>>
>> Στις 9/2/2012 10:51 μμ, ο/η Steven A Rowe έγραψε:
>>> Damerian,
>>>
>>> The technique I mentioned would work for you with a little tweaking:
>> when you see consecutive capitalized tokens, then just set the
>> CharTermAttribute to the joined tokens, and clear the previous token.
>>> Another idea: you could use ShingleFilter with min size = max size = 2,
>> and then use a following Filter extending FilteringTokenFilter, with an
>> accept() method that examines shingles and rejects ones that don't
>> qualify, something like the following.  (Notes: this is untested; I assume
>> you will use the default shingle token separator " "; and this filter will
>> reject all non-shingle terms, so you won't get anything but names, even if
>> you configure ShingleFilter to emit single tokens):
>>> public final class MyNameFilter extends FilteringTokenFilter {
>>>     private static final Pattern NAME_PATTERN
>>>         = Pattern.compile("\\p{Lu}\\S*(?:\\s\\p{Lu}\\S*)+");
>>>     private final CharTermAttribute termAtt =
>> addAttribute(CharTermAttribute.class);
>>>     @Override public boolean accept() throws IOException {
>>>       return NAME_PATTERN.matcher(termAtt).matches();
>>>     }
>>> }
>>>
>>> Steve
>>>
>>>> -----Original Message-----
>>>> From: Damerian [mailto:dameriangr@gmail.com]
>>>> Sent: Thursday, February 09, 2012 4:15 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: Access next token in a stream
>>>>
>>>> Στις 9/2/2012 8:54 μμ, ο/η Steven A Rowe έγραψε:
>>>>> Hi Damerian,
>>>>>
>>>>> One way to handle your scenario is to hold on to the previous token,
>> and
>>>> only emit a token after you reach at least the second token (or at end-
>> of-
>>>> stream).  Your incrementToken() method could look something like:
>>>>> 1. Get current attributes: input.incrementToken()
>>>>> 2. If previous token does not exist:
>>>>>          2a. Store current attributes as previous token (see
>>>> AttributeSource#cloneAttributes)
>>>>> 	2b. Get current attributes: input.incrementToken()
>>>>> 3. Check for&    store conditions that will affect previous token's
>>>> attributes
>>>>> 4. Store current attributes as next token (see
>>>> AttributeSource#cloneAttributes)
>>>>> 5. Copy previous token into current attributes (see
>>>> AttributeSource#copyTo);
>>>>>       the target will be "this", which is an AttributeSource.
>>>>> 6. Make changes based on conditions found in step #3 above
>>>>> 7. set previous token = next token
>>>>> 8. return true
>>>>>
>>>>> (Everywhere I say "token" I mean "instance of AttributeSource".)
>>>>>
>>>>> The final token in the input stream will need special handling, as
>> will
>>>> single-token input streams.
>>>>> Good luck,
>>>>> Steve
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Damerian [mailto:dameriangr@gmail.com]
>>>>>> Sent: Thursday, February 09, 2012 2:19 PM
>>>>>> To: java-user@lucene.apache.org
>>>>>> Subject: Access next token in a stream
>>>>>>
>>>>>> Hello i want to implement my custom filter, my wuestion is quite
>> simple
>>>>>> but i cannot find a solution to it no matter how i try:
>>>>>>
>>>>>> How can i access the TermAttribute of the  next token than the one
i
>>>>>> currently have in my stream?
>>>>>>
>>>>>> For example in  the phrase "My name is James Bond" if let's say i
am
>> in
>>>>>> the token [My], i would like to be able to check the TermAttribute
of
>>>>>> the following token [name] and fix my position increment accordingly.
>>>>>>
>>>>>> Thank you in advance!
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> Hi Steve,
>>>> Thank you for your immediate reply. i will try your solution but i feel
>>>> that it does not solve my case.
>>>> What i am trying to make is a filter that joins together two
>>>> terms/tokens that start with a capital letter (it is trying to find all
>>>> the Names/Surnames and make them one token)  so in my aforementioned
>>>> example when i examine [James] even if i store the TermAttribute to a
>>>> temporary token how can i check the next one [Bond] , to join them
>>>> without actually emmiting (and therefore creating a term in my inverted
>>>> index) that has [James] on its own.
>>>> Thank you again for your insight and i would relly appreciate any other
>>>> views on the matter.
>>>>
>>>> Regards, Damerian
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> I think my solution in almost full now only one question you mentioned
>> "clear the previous token. ". Is there a built-in method for doing that?
>> In the begining i thought that if i put my new token into the same
>> position increment it would "overwrite" the previous one , but what i
>> succeeded was to simply inject code.. my method that does that so far is
>> this:
>>
>> @Override
>>       public boolean incrementToken() throws IOException {
>>           if (!input.incrementToken()) {
>>               return false;
>>           }
>>           //Case were the previous token WAS NOT starting with capital
>> letter and the rest small
>>           if (previousTokenCanditateMainName == false) {
>>               if (CheckIfMainName(termAtt.term())) {
>>                   previousTokenCanditateMainName = true;
>>                   tempString =
>> this.termAtt.term();                           /*This is the*/
>>                   //
>> myToken.offsetAtt=this.offsetAtt;                             /*Token i
>> need to "delete"*/
>>                   tempStartOffset = this.offsetAtt.startOffset();
>>                   tempEndOffset = this.offsetAtt.endOffset();
>>                   //this.nextInputStreamToken.clearAttributes();
>>
>>                   return true;
>>               } else {
>>                   return true;
>>               }
>>           } //Case were the previous token WAS a Proper name (starting
>> with Capital and continuiing with small letters)
>>           else {
>>               if (CheckIfMainName(termAtt.term())) {
>>                   previousTokenCanditateMainName = false;
>>                   posIncrAtt.setPositionIncrement(0);
>>                   String myString=tempString + TOKEN_SEPARATOR +
>> this.termAtt.term();
>>
>>                   //termAtt.setTermBuffer(myString, tempStartOffset,
>> myString.length());
>>                   termAtt.setTermBuffer(tempString + TOKEN_SEPARATOR +
>> this.termAtt.term());
>>                   offsetAtt.setOffset(tempStartOffset,
>> this.offsetAtt.endOffset());
>>                   return true;
>>               } else {
>>                   previousTokenCanditateMainName = false;
>>                   return true;
>>               }
>>           }
>>
>>       }
>>
>> The checkIfMain() method is a simple custom made method to decide
>> whether the token fullfills the criteria.
>>
>> Once again thank you very much for your help, and the time that you
>> spend in helping me
>>
>> regards
>> /Damerian
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
Steve one last Thank you! I gained valueable knowledge tonight!

/Damerian

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message