Στις 9/2/2012 11:12 μμ, ο/η Steven A Rowe έγραψε:
> Damerian,
>
> When I said "clear the previous token", I was referring to the pseudo-code I gave in
my first response to you. There is no built-in method to do that. If you want to conditionally
output tokens, you should store AttributeSource clones, as in my pseudo-code.
>
> Steve
>
>> -----Original Message-----
>> From: Damerian [mailto:dameriangr@gmail.com]
>> Sent: Thursday, February 09, 2012 5:00 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Access next token in a stream
>>
>> Στις 9/2/2012 10:51 μμ, ο/η Steven A Rowe έγραψε:
>>> Damerian,
>>>
>>> The technique I mentioned would work for you with a little tweaking:
>> when you see consecutive capitalized tokens, then just set the
>> CharTermAttribute to the joined tokens, and clear the previous token.
>>> Another idea: you could use ShingleFilter with min size = max size = 2,
>> and then use a following Filter extending FilteringTokenFilter, with an
>> accept() method that examines shingles and rejects ones that don't
>> qualify, something like the following. (Notes: this is untested; I assume
>> you will use the default shingle token separator " "; and this filter will
>> reject all non-shingle terms, so you won't get anything but names, even if
>> you configure ShingleFilter to emit single tokens):
>>> public final class MyNameFilter extends FilteringTokenFilter {
>>> private static final Pattern NAME_PATTERN
>>> = Pattern.compile("\\p{Lu}\\S*(?:\\s\\p{Lu}\\S*)+");
>>> private final CharTermAttribute termAtt =
>> addAttribute(CharTermAttribute.class);
>>> @Override public boolean accept() throws IOException {
>>> return NAME_PATTERN.matcher(termAtt).matches();
>>> }
>>> }
>>>
>>> Steve
>>>
>>>> -----Original Message-----
>>>> From: Damerian [mailto:dameriangr@gmail.com]
>>>> Sent: Thursday, February 09, 2012 4:15 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: Access next token in a stream
>>>>
>>>> Στις 9/2/2012 8:54 μμ, ο/η Steven A Rowe έγραψε:
>>>>> Hi Damerian,
>>>>>
>>>>> One way to handle your scenario is to hold on to the previous token,
>> and
>>>> only emit a token after you reach at least the second token (or at end-
>> of-
>>>> stream). Your incrementToken() method could look something like:
>>>>> 1. Get current attributes: input.incrementToken()
>>>>> 2. If previous token does not exist:
>>>>> 2a. Store current attributes as previous token (see
>>>> AttributeSource#cloneAttributes)
>>>>> 2b. Get current attributes: input.incrementToken()
>>>>> 3. Check for& store conditions that will affect previous token's
>>>> attributes
>>>>> 4. Store current attributes as next token (see
>>>> AttributeSource#cloneAttributes)
>>>>> 5. Copy previous token into current attributes (see
>>>> AttributeSource#copyTo);
>>>>> the target will be "this", which is an AttributeSource.
>>>>> 6. Make changes based on conditions found in step #3 above
>>>>> 7. set previous token = next token
>>>>> 8. return true
>>>>>
>>>>> (Everywhere I say "token" I mean "instance of AttributeSource".)
>>>>>
>>>>> The final token in the input stream will need special handling, as
>> will
>>>> single-token input streams.
>>>>> Good luck,
>>>>> Steve
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Damerian [mailto:dameriangr@gmail.com]
>>>>>> Sent: Thursday, February 09, 2012 2:19 PM
>>>>>> To: java-user@lucene.apache.org
>>>>>> Subject: Access next token in a stream
>>>>>>
>>>>>> Hello i want to implement my custom filter, my wuestion is quite
>> simple
>>>>>> but i cannot find a solution to it no matter how i try:
>>>>>>
>>>>>> How can i access the TermAttribute of the next token than the one
i
>>>>>> currently have in my stream?
>>>>>>
>>>>>> For example in the phrase "My name is James Bond" if let's say i
am
>> in
>>>>>> the token [My], i would like to be able to check the TermAttribute
of
>>>>>> the following token [name] and fix my position increment accordingly.
>>>>>>
>>>>>> Thank you in advance!
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> Hi Steve,
>>>> Thank you for your immediate reply. i will try your solution but i feel
>>>> that it does not solve my case.
>>>> What i am trying to make is a filter that joins together two
>>>> terms/tokens that start with a capital letter (it is trying to find all
>>>> the Names/Surnames and make them one token) so in my aforementioned
>>>> example when i examine [James] even if i store the TermAttribute to a
>>>> temporary token how can i check the next one [Bond] , to join them
>>>> without actually emmiting (and therefore creating a term in my inverted
>>>> index) that has [James] on its own.
>>>> Thank you again for your insight and i would relly appreciate any other
>>>> views on the matter.
>>>>
>>>> Regards, Damerian
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> I think my solution in almost full now only one question you mentioned
>> "clear the previous token. ". Is there a built-in method for doing that?
>> In the begining i thought that if i put my new token into the same
>> position increment it would "overwrite" the previous one , but what i
>> succeeded was to simply inject code.. my method that does that so far is
>> this:
>>
>> @Override
>> public boolean incrementToken() throws IOException {
>> if (!input.incrementToken()) {
>> return false;
>> }
>> //Case were the previous token WAS NOT starting with capital
>> letter and the rest small
>> if (previousTokenCanditateMainName == false) {
>> if (CheckIfMainName(termAtt.term())) {
>> previousTokenCanditateMainName = true;
>> tempString =
>> this.termAtt.term(); /*This is the*/
>> //
>> myToken.offsetAtt=this.offsetAtt; /*Token i
>> need to "delete"*/
>> tempStartOffset = this.offsetAtt.startOffset();
>> tempEndOffset = this.offsetAtt.endOffset();
>> //this.nextInputStreamToken.clearAttributes();
>>
>> return true;
>> } else {
>> return true;
>> }
>> } //Case were the previous token WAS a Proper name (starting
>> with Capital and continuiing with small letters)
>> else {
>> if (CheckIfMainName(termAtt.term())) {
>> previousTokenCanditateMainName = false;
>> posIncrAtt.setPositionIncrement(0);
>> String myString=tempString + TOKEN_SEPARATOR +
>> this.termAtt.term();
>>
>> //termAtt.setTermBuffer(myString, tempStartOffset,
>> myString.length());
>> termAtt.setTermBuffer(tempString + TOKEN_SEPARATOR +
>> this.termAtt.term());
>> offsetAtt.setOffset(tempStartOffset,
>> this.offsetAtt.endOffset());
>> return true;
>> } else {
>> previousTokenCanditateMainName = false;
>> return true;
>> }
>> }
>>
>> }
>>
>> The checkIfMain() method is a simple custom made method to decide
>> whether the token fullfills the criteria.
>>
>> Once again thank you very much for your help, and the time that you
>> spend in helping me
>>
>> regards
>> /Damerian
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
Steve one last Thank you! I gained valueable knowledge tonight!
/Damerian
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
|