lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <sar...@syr.edu>
Subject RE: Access next token in a stream
Date Thu, 09 Feb 2012 21:51:33 GMT
Damerian,

The technique I mentioned would work for you with a little tweaking: when you see consecutive
capitalized tokens, then just set the CharTermAttribute to the joined tokens, and clear the
previous token.

Another idea: you could use ShingleFilter with min size = max size = 2, and then use a following
Filter extending FilteringTokenFilter, with an accept() method that examines shingles and
rejects ones that don't qualify, something like the following.  (Notes: this is untested;
I assume you will use the default shingle token separator " "; and this filter will reject
all non-shingle terms, so you won't get anything but names, even if you configure ShingleFilter
to emit single tokens):

public final class MyNameFilter extends FilteringTokenFilter {
  private static final Pattern NAME_PATTERN 
      = Pattern.compile("\\p{Lu}\\S*(?:\\s\\p{Lu}\\S*)+");
  private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
  @Override public boolean accept() throws IOException {
    return NAME_PATTERN.matcher(termAtt).matches();
  }
}

Steve

> -----Original Message-----
> From: Damerian [mailto:dameriangr@gmail.com]
> Sent: Thursday, February 09, 2012 4:15 PM
> To: java-user@lucene.apache.org
> Subject: Re: Access next token in a stream
> 
> Στις 9/2/2012 8:54 μμ, ο/η Steven A Rowe έγραψε:
> > Hi Damerian,
> >
> > One way to handle your scenario is to hold on to the previous token, and
> only emit a token after you reach at least the second token (or at end-of-
> stream).  Your incrementToken() method could look something like:
> >
> > 1. Get current attributes: input.incrementToken()
> > 2. If previous token does not exist:
> >        2a. Store current attributes as previous token (see
> AttributeSource#cloneAttributes)
> > 	2b. Get current attributes: input.incrementToken()
> > 3. Check for&  store conditions that will affect previous token's
> attributes
> > 4. Store current attributes as next token (see
> AttributeSource#cloneAttributes)
> > 5. Copy previous token into current attributes (see
> AttributeSource#copyTo);
> >     the target will be "this", which is an AttributeSource.
> > 6. Make changes based on conditions found in step #3 above
> > 7. set previous token = next token
> > 8. return true
> >
> > (Everywhere I say "token" I mean "instance of AttributeSource".)
> >
> > The final token in the input stream will need special handling, as will
> single-token input streams.
> >
> > Good luck,
> > Steve
> >
> >> -----Original Message-----
> >> From: Damerian [mailto:dameriangr@gmail.com]
> >> Sent: Thursday, February 09, 2012 2:19 PM
> >> To: java-user@lucene.apache.org
> >> Subject: Access next token in a stream
> >>
> >> Hello i want to implement my custom filter, my wuestion is quite simple
> >> but i cannot find a solution to it no matter how i try:
> >>
> >> How can i access the TermAttribute of the  next token than the one i
> >> currently have in my stream?
> >>
> >> For example in  the phrase "My name is James Bond" if let's say i am in
> >> the token [My], i would like to be able to check the TermAttribute of
> >> the following token [name] and fix my position increment accordingly.
> >>
> >> Thank you in advance!
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> Hi Steve,
> Thank you for your immediate reply. i will try your solution but i feel
> that it does not solve my case.
> What i am trying to make is a filter that joins together two
> terms/tokens that start with a capital letter (it is trying to find all
> the Names/Surnames and make them one token)  so in my aforementioned
> example when i examine [James] even if i store the TermAttribute to a
> temporary token how can i check the next one [Bond] , to join them
> without actually emmiting (and therefore creating a term in my inverted
> index) that has [James] on its own.
> Thank you again for your insight and i would relly appreciate any other
> views on the matter.
> 
> Regards, Damerian
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message