lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: How to deal with Token in the new TS API
Date Sun, 22 Nov 2009 14:42:50 GMT
> Because that'd mean I'll check for abbreviations for every token. Which is
> a
> big performance loss. That way, I can just check abbr if I encountered a
> "."
> (not even all end-of-sentence tokens).

OK, than simply copy the term to a String and store it. The cost is the same
like cloning/copying. If you find the ".", use the String and look it up.

> Why can't State offer a "getAttribute" like AttributeSource?

Because State is optimized for fast restore. In previous 2.9 versions State
was itself an AttributeSource instance, but the capture/store was very, very
slow.

If you want to check an State, you would have need to iterate over all
attributes and find the correct one, which is also slow. The best is to
simply clone the term text as a string. You must create new objects in all
cases, even with clone/copy.

Uwe

> Shai
> 
> On Sun, Nov 22, 2009 at 4:34 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
> 
> > If you just want to lookup if "Mr" is an abbreviation, why not look it
> up
> > when you handle that token and set a boolean variable in the TS
> > (lastTokenWasAbbreviation). When you process the ".", remove it if the
> > Boolean is set.
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >
> > > -----Original Message-----
> > > From: Shai Erera [mailto:serera@gmail.com]
> > > Sent: Sunday, November 22, 2009 3:28 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Re: How to deal with Token in the new TS API
> > >
> > > What I've done is:
> > >
> > > State state = in.captureState();
> > > ...
> > > // Upon new call to incrementToken().
> > > State tmp = in.captureState();
> > > in.restoreState(state);
> > > // check if termAttribute is an abbreviation.
> > > If not : in.restoreState(tmp);
> > >
> > > But seems a lot of capturing/restoring to me ... how expensive is
> that?
> > >
> > > Shai
> > >
> > > On Sun, Nov 22, 2009 at 3:57 PM, Shai Erera <serera@gmail.com> wrote:
> > >
> > > > Perhaps I misunderstand something. The current use case I'm trying
> to
> > > solve
> > > > is - I have an abbreviations TokenFilter which reads a token and
> stores
> > > it.
> > > > If the next token is end-of-sentence, it checks whether the previous
> > one
> > > is
> > > > in the abbreviations list, and discards the end-of-sentence token. I
> > > need to
> > > > store the first token somewhere so I can reference it.
> > > >
> > > > Example: "hello mr. shai"
> > > > First token = hello -> store it and return
> > > > Second token = mr -> store it and return
> > > > Third token = "." -> check if "mr" is an abbreviation, if so don't
> > > return
> > > > ".".
> > > > Fourth token = "shai" -> store it and return.
> > > > ...
> > > >
> > > > How do I store "mr" (or any of the others)? It was easy w/ copyTo.
> If I
> > > > captureState, I get a State, but I can't query it for a
> TermAttribute.
> > > Any
> > > > ideas?
> > > >
> > > > Shai
> > > >
> > > >
> > > > On Sun, Nov 22, 2009 at 3:33 PM, Uwe Schindler <uwe@thetaphi.de>
> > wrote:
> > > >
> > > >> Use captureState and save the state somewhere. You can restore the
> > > state
> > > >> with restoreState to the TokenStream. CachingTokenFilter does this.
> > > >>
> > > >> So the new API uses the State object to put away tokens for later
> > > >> reference.
> > > >>
> > > >> -----
> > > >> Uwe Schindler
> > > >> H.-H.-Meier-Allee 63, D-28213 Bremen
> > > >> http://www.thetaphi.de
> > > >> eMail: uwe@thetaphi.de
> > > >>
> > > >> > -----Original Message-----
> > > >> > From: Shai Erera [mailto:serera@gmail.com]
> > > >> > Sent: Sunday, November 22, 2009 2:29 PM
> > > >> > To: java-user@lucene.apache.org
> > > >> > Subject: Re: How to deal with Token in the new TS API
> > > >> >
> > > >> > ok so from what I understand, I should stop working w/ Token,
and
> > > move
> > > >> to
> > > >> > working w/ the Attributes.
> > > >> >
> > > >> > addAttribute indeed does not work. Even though it does not
> through
> > an
> > > >> > exception, if I call in.addAttribute(Token.class), I get a new
> > > instance
> > > >> of
> > > >> > Token and not the once that was added by in. So this is even
more
> > > severe
> > > >> > than just not blocking this option.
> > > >> >
> > > >> > I thought I can move to use addAttributeImpl, but that won't
help
> > me,
> > > >> > because I won't be able to call getAttribute(Token.class).
> > > >> >
> > > >> > So this leaves me w/ just working w/ the interfaces.
> > > >> >
> > > >> > What do I need to do in order to clone an attribute? Previously
I
> > > used
> > > >> > token.copyTo(target). How I can do it now if I don't have copyTo
> on
> > > the
> > > >> > interfaces, and/or clone?
> > > >> >
> > > >> > Shai
> > > >> >
> > > >> > On Sun, Nov 22, 2009 at 2:58 PM, Uwe Schindler <uwe@thetaphi.de>
> > > wrote:
> > > >> >
> > > >> > > > But I do use addAttribute(Token.class), so I don't
understand
> > why
> > > >> you
> > > >> > say
> > > >> > > > it's not possible. And I completely don't understand
why the
> new
> > > API
> > > >> > > > allows
> > > >> > > > me to just work w/ interfaces and not impls ... A while
ago I
> > got
> > > >> the
> > > >> > > > impression that we're trying to get rid of interfaces
because
> > > >> they're
> > > >> > not
> > > >> > > > easy to maintain back-compat with ...
> > > >> > >
> > > >> > > AddAttribute(Token.class) should throw an Exception, but
it
> > doesn't
> > > >> > (it's a
> > > >> > > bug in 3.0). addAttribute should only affect interfaces,
it
> also
> > > >> accepts
> > > >> > > Token, because the AttributeFactory accepts it - bang.
> > > >> > >
> > > >> > > Sorry, but you can only pass attribute class literals to
> > > >> > > addAttribute/getAttribute/hasAttribute and so on.
> > > >> > >
> > > >> > > Sorry.
> > > >> > >
> > > >> > > Uwe
> > > >> > >
> > > >> > >
> > > >> > >
> > -------------------------------------------------------------------
> > > --
> > > >> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> > > For additional commands, e-mail: java-user-
> help@lucene.apache.org
> > > >> > >
> > > >> > >
> > > >>
> > > >>
> > > >> -------------------------------------------------------------------
> --
> > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>
> > > >>
> > > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message