lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damerian <dameria...@gmail.com>
Subject Re: Access next token in a stream
Date Thu, 09 Feb 2012 21:59:34 GMT
Στις 9/2/2012 10:51 μμ, ο/η Steven A Rowe έγραψε:
> Damerian,
>
> The technique I mentioned would work for you with a little tweaking: when you see consecutive
capitalized tokens, then just set the CharTermAttribute to the joined tokens, and clear the
previous token.
>
> Another idea: you could use ShingleFilter with min size = max size = 2, and then use
a following Filter extending FilteringTokenFilter, with an accept() method that examines shingles
and rejects ones that don't qualify, something like the following.  (Notes: this is untested;
I assume you will use the default shingle token separator " "; and this filter will reject
all non-shingle terms, so you won't get anything but names, even if you configure ShingleFilter
to emit single tokens):
>
> public final class MyNameFilter extends FilteringTokenFilter {
>    private static final Pattern NAME_PATTERN
>        = Pattern.compile("\\p{Lu}\\S*(?:\\s\\p{Lu}\\S*)+");
>    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
>    @Override public boolean accept() throws IOException {
>      return NAME_PATTERN.matcher(termAtt).matches();
>    }
> }
>
> Steve
>
>> -----Original Message-----
>> From: Damerian [mailto:dameriangr@gmail.com]
>> Sent: Thursday, February 09, 2012 4:15 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Access next token in a stream
>>
>> Στις 9/2/2012 8:54 μμ, ο/η Steven A Rowe έγραψε:
>>> Hi Damerian,
>>>
>>> One way to handle your scenario is to hold on to the previous token, and
>> only emit a token after you reach at least the second token (or at end-of-
>> stream).  Your incrementToken() method could look something like:
>>> 1. Get current attributes: input.incrementToken()
>>> 2. If previous token does not exist:
>>>         2a. Store current attributes as previous token (see
>> AttributeSource#cloneAttributes)
>>> 	2b. Get current attributes: input.incrementToken()
>>> 3. Check for&   store conditions that will affect previous token's
>> attributes
>>> 4. Store current attributes as next token (see
>> AttributeSource#cloneAttributes)
>>> 5. Copy previous token into current attributes (see
>> AttributeSource#copyTo);
>>>      the target will be "this", which is an AttributeSource.
>>> 6. Make changes based on conditions found in step #3 above
>>> 7. set previous token = next token
>>> 8. return true
>>>
>>> (Everywhere I say "token" I mean "instance of AttributeSource".)
>>>
>>> The final token in the input stream will need special handling, as will
>> single-token input streams.
>>> Good luck,
>>> Steve
>>>
>>>> -----Original Message-----
>>>> From: Damerian [mailto:dameriangr@gmail.com]
>>>> Sent: Thursday, February 09, 2012 2:19 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Access next token in a stream
>>>>
>>>> Hello i want to implement my custom filter, my wuestion is quite simple
>>>> but i cannot find a solution to it no matter how i try:
>>>>
>>>> How can i access the TermAttribute of the  next token than the one i
>>>> currently have in my stream?
>>>>
>>>> For example in  the phrase "My name is James Bond" if let's say i am in
>>>> the token [My], i would like to be able to check the TermAttribute of
>>>> the following token [name] and fix my position increment accordingly.
>>>>
>>>> Thank you in advance!
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> Hi Steve,
>> Thank you for your immediate reply. i will try your solution but i feel
>> that it does not solve my case.
>> What i am trying to make is a filter that joins together two
>> terms/tokens that start with a capital letter (it is trying to find all
>> the Names/Surnames and make them one token)  so in my aforementioned
>> example when i examine [James] even if i store the TermAttribute to a
>> temporary token how can i check the next one [Bond] , to join them
>> without actually emmiting (and therefore creating a term in my inverted
>> index) that has [James] on its own.
>> Thank you again for your insight and i would relly appreciate any other
>> views on the matter.
>>
>> Regards, Damerian
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
I think my solution in almost full now only one question you mentioned
"clear the previous token. ". Is there a built-in method for doing that? 
In the begining i thought that if i put my new token into the same 
position increment it would "overwrite" the previous one , but what i 
succeeded was to simply inject code.. my method that does that so far is 
this:

@Override
     public boolean incrementToken() throws IOException {
         if (!input.incrementToken()) {
             return false;
         }
         //Case were the previous token WAS NOT starting with capital 
letter and the rest small
         if (previousTokenCanditateMainName == false) {
             if (CheckIfMainName(termAtt.term())) {
                 previousTokenCanditateMainName = true;
                 tempString = 
this.termAtt.term();                           /*This is the*/
                 // 
myToken.offsetAtt=this.offsetAtt;                             /*Token i 
need to "delete"*/
                 tempStartOffset = this.offsetAtt.startOffset();
                 tempEndOffset = this.offsetAtt.endOffset();
                 //this.nextInputStreamToken.clearAttributes();

                 return true;
             } else {
                 return true;
             }
         } //Case were the previous token WAS a Proper name (starting 
with Capital and continuiing with small letters)
         else {
             if (CheckIfMainName(termAtt.term())) {
                 previousTokenCanditateMainName = false;
                 posIncrAtt.setPositionIncrement(0);
                 String myString=tempString + TOKEN_SEPARATOR + 
this.termAtt.term();

                 //termAtt.setTermBuffer(myString, tempStartOffset, 
myString.length());
                 termAtt.setTermBuffer(tempString + TOKEN_SEPARATOR + 
this.termAtt.term());
                 offsetAtt.setOffset(tempStartOffset, 
this.offsetAtt.endOffset());
                 return true;
             } else {
                 previousTokenCanditateMainName = false;
                 return true;
             }
         }

     }

The checkIfMain() method is a simple custom made method to decide 
whether the token fullfills the criteria.

Once again thank you very much for your help, and the time that you 
spend in helping me

regards
/Damerian

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message