lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: How to deal with Token in the new TS API
Date Sun, 22 Nov 2009 19:30:07 GMT
To call clear, you can always downcast to AttributeImpl. But you need to
know, that it may clear also other attributes (like if it is a Token). So
setting termLength to 0 is the fastest approach, if you only need the term
att.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Shai Erera [mailto:serera@gmail.com]
> Sent: Sunday, November 22, 2009 8:26 PM
> To: java-user@lucene.apache.org
> Subject: Re: How to deal with Token in the new TS API
> 
> Ok I see you fixed it at the same time I sent the email :).
> 
> I think I get it ... so far.
> 
> So far I had to cache just TermAttribute. I think it'll get messy when
> I'll
> need to cache more, like Type and PositionIncrement. But I haven't reached
> those yet. Perhaps instead of creating many types of clones, I'll create
> Token and populate it w/ what I need, just for convenience ...
> 
> Thanks,
> Shai
> 
> On Sun, Nov 22, 2009 at 9:23 PM, Shai Erera <serera@gmail.com> wrote:
> 
> > I assume termAtt is the input's TermAttribute, right? Therefore it has
> no
> > copyTo ...
> >
> > What I've done so far is create a TermAttribute like you proposed (fixed
> > from my previous TermAttributeImpl):
> >
> > TermAttribute clone = (TermAttribute)
> >
> input.getAttributeFactory().createAttributeInstance(TermAttribute.class);
> >
> > and to clear() it I just clone.setTermLength(0);
> >
> > This at least works for me.
> >
> > Shai
> >
> >
> > On Sun, Nov 22, 2009 at 9:14 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
> >
> >> Another idea, what you can also do is, create an AttributeSource
> instance
> >> in
> >> your TokenStream one time using the AttributeSource.cloneAttributes()
> >> call.
> >> You can use this copy of the attributes in parallel and maybe update
> the
> >> TermAttribute there and so on. If you want to look at the last token,
> just
> >> look into the copied attributesource. The calls to
> >> addAttribute/getAttribute
> >> of this source can be done after cloning.
> >>
> >> Class Initializer:
> >> private final AttributeSource lastState = cloneAttributes();
> >> private final TermAttribute lastTermAtt =
> >> lastState.addAttribute(TermAttribute.class);
> >>
> >> incrementToken:
> >>
> >> if (input.incrementToken()) {
> >>        if (lastTermAtt.checkSomethingAsYouProposed) {
> >>                blubber...
> >>        }
> >>        termAtt.copyTo(lastTermAtt); // save current state
> >>        return true;
> >> } else return false;
> >>
> >>
> >>
> >> -----
> >> Uwe Schindler
> >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >> http://www.thetaphi.de
> >> eMail: uwe@thetaphi.de
> >>
> >>
> >> > -----Original Message-----
> >> > From: Uwe Schindler [mailto:uwe@thetaphi.de]
> >> > Sent: Sunday, November 22, 2009 8:03 PM
> >> > To: java-user@lucene.apache.org
> >> > Subject: RE: How to deal with Token in the new TS API
> >> >
> >> > I said, you *could* if it would be exposed. But the State is a holder
> >> > class
> >> > without functionality. Because the internals are impl dependent,
> maybe
> >> we
> >> > will add such thing in future. But: If the state contains a real map,
> it
> >> > would be slow, because each captureState call would need to fill the
> >> map,
> >> > which is slow. And: If you use the Token as AttImpl, the state will
> only
> >> > contain one entry. You cannot control which attribute is implemented
> by
> >> > what
> >> > impl, so the map approach would never work correct.
> >> >
> >> >
> >> >
> >> > You can allocate a TermAttributeImpl and copyTo, but you should
> create
> >> the
> >> > instance using the same factory as the tokenstream uses:
> >> >
> >> >
> >> >
> >> > TermAttribute copy = (TermAttribute)
> >> > getAttributeFactory().createAttributeInstance(TermAttribute.class);
> >> >
> >> >
> >> >
> >> > By that you guarantee, that both are from the same implementation
> type.
> >> >
> >> >
> >> >
> >> > -----
> >> >
> >> > Uwe Schindler
> >> >
> >> > H.-H.-Meier-Allee 63, D-28213 Bremen
> >> >
> >> > http://www.thetaphi.de
> >> >
> >> > eMail: uwe@thetaphi.de
> >> >
> >> >
> >> >
> >> > > -----Original Message-----
> >> >
> >> > > From: Shai Erera [mailto:serera@gmail.com]
> >> >
> >> > > Sent: Sunday, November 22, 2009 7:53 PM
> >> >
> >> > > To: java-user@lucene.apache.org
> >> >
> >> > > Subject: Re: How to deal with Token in the new TS API
> >> >
> >> > >
> >> >
> >> > > Yes I can clone the term itself by instantiating a
> TermAttributeImpl,
> >> >
> >> > > which
> >> >
> >> > > is better than storing the String, because the latter always
> allocates
> >> >
> >> > > char[], while the former will reuse the char[] if it's big enough.
> >> >
> >> > >
> >> >
> >> > > What if State included a HashMap of all attributes, in addition to
> its
> >> >
> >> > > "linked-list" structure?
> >> >
> >> > >
> >> >
> >> > > Anyway, you mention that I can iterate on all Attributes of a
> State,
> >> but
> >> >
> >> > > it's not clear to me how to do it, since I don't see any relevant
> >> method
> >> >
> >> > > in
> >> >
> >> > > its API. Am I missing something?
> >> >
> >> > >
> >> >
> >> > > Shai
> >> >
> >> > >
> >> >
> >> > > On Sun, Nov 22, 2009 at 4:42 PM, Uwe Schindler <uwe@thetaphi.de>
> >> wrote:
> >> >
> >> > >
> >> >
> >> > > > > Because that'd mean I'll check for abbreviations for every
> token.
> >> >
> >> > > Which
> >> >
> >> > > > is
> >> >
> >> > > > > a
> >> >
> >> > > > > big performance loss. That way, I can just check abbr if
I
> >> > encountered
> >> >
> >> > > a
> >> >
> >> > > > > "."
> >> >
> >> > > > > (not even all end-of-sentence tokens).
> >> >
> >> > > >
> >> >
> >> > > > OK, than simply copy the term to a String and store it. The cost
> is
> >> > the
> >> >
> >> > > > same
> >> >
> >> > > > like cloning/copying. If you find the ".", use the String and
> look
> >> it
> >> >
> >> > > up.
> >> >
> >> > > >
> >> >
> >> > > > > Why can't State offer a "getAttribute" like AttributeSource?
> >> >
> >> > > >
> >> >
> >> > > > Because State is optimized for fast restore. In previous 2.9
> >> versions
> >> >
> >> > > State
> >> >
> >> > > > was itself an AttributeSource instance, but the capture/store
was
> >> > very,
> >> >
> >> > > > very
> >> >
> >> > > > slow.
> >> >
> >> > > >
> >> >
> >> > > > If you want to check an State, you would have need to iterate
> over
> >> all
> >> >
> >> > > > attributes and find the correct one, which is also slow. The
best
> is
> >> > to
> >> >
> >> > > > simply clone the term text as a string. You must create new
> objects
> >> in
> >> >
> >> > > all
> >> >
> >> > > > cases, even with clone/copy.
> >> >
> >> > > >
> >> >
> >> > > > Uwe
> >> >
> >> > > >
> >> >
> >> > > > > Shai
> >> >
> >> > > > >
> >> >
> >> > > > > On Sun, Nov 22, 2009 at 4:34 PM, Uwe Schindler
> <uwe@thetaphi.de>
> >> >
> >> > > wrote:
> >> >
> >> > > > >
> >> >
> >> > > > > > If you just want to lookup if "Mr" is an abbreviation,
why
> not
> >> > look
> >> >
> >> > > it
> >> >
> >> > > > > up
> >> >
> >> > > > > > when you handle that token and set a boolean variable
in the
> TS
> >> >
> >> > > > > > (lastTokenWasAbbreviation). When you process the ".",
remove
> it
> >> if
> >> >
> >> > > the
> >> >
> >> > > > > > Boolean is set.
> >> >
> >> > > > > >
> >> >
> >> > > > > > Uwe
> >> >
> >> > > > > >
> >> >
> >> > > > > > -----
> >> >
> >> > > > > > Uwe Schindler
> >> >
> >> > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> >> >
> >> > > > > > http://www.thetaphi.de
> >> >
> >> > > > > > eMail: uwe@thetaphi.de
> >> >
> >> > > > > >
> >> >
> >> > > > > >
> >> >
> >> > > > > > > -----Original Message-----
> >> >
> >> > > > > > > From: Shai Erera [mailto:serera@gmail.com]
> >> >
> >> > > > > > > Sent: Sunday, November 22, 2009 3:28 PM
> >> >
> >> > > > > > > To: java-user@lucene.apache.org
> >> >
> >> > > > > > > Subject: Re: How to deal with Token in the new
TS API
> >> >
> >> > > > > > >
> >> >
> >> > > > > > > What I've done is:
> >> >
> >> > > > > > >
> >> >
> >> > > > > > > State state = in.captureState();
> >> >
> >> > > > > > > ...
> >> >
> >> > > > > > > // Upon new call to incrementToken().
> >> >
> >> > > > > > > State tmp = in.captureState();
> >> >
> >> > > > > > > in.restoreState(state);
> >> >
> >> > > > > > > // check if termAttribute is an abbreviation.
> >> >
> >> > > > > > > If not : in.restoreState(tmp);
> >> >
> >> > > > > > >
> >> >
> >> > > > > > > But seems a lot of capturing/restoring to me ...
how
> expensive
> >> > is
> >> >
> >> > > > > that?
> >> >
> >> > > > > > >
> >> >
> >> > > > > > > Shai
> >> >
> >> > > > > > >
> >> >
> >> > > > > > > On Sun, Nov 22, 2009 at 3:57 PM, Shai Erera
> <serera@gmail.com
> >> >
> >> >
> >> > > > wrote:
> >> >
> >> > > > > > >
> >> >
> >> > > > > > > > Perhaps I misunderstand something. The current
use case
> I'm
> >> >
> >> > > trying
> >> >
> >> > > > > to
> >> >
> >> > > > > > > solve
> >> >
> >> > > > > > > > is - I have an abbreviations TokenFilter
which reads a
> token
> >> > and
> >> >
> >> > > > > stores
> >> >
> >> > > > > > > it.
> >> >
> >> > > > > > > > If the next token is end-of-sentence, it
checks whether
> the
> >> >
> >> > > > previous
> >> >
> >> > > > > > one
> >> >
> >> > > > > > > is
> >> >
> >> > > > > > > > in the abbreviations list, and discards the
end-of-
> sentence
> >> >
> >> > > token.
> >> >
> >> > > > I
> >> >
> >> > > > > > > need to
> >> >
> >> > > > > > > > store the first token somewhere so I can
reference it.
> >> >
> >> > > > > > > >
> >> >
> >> > > > > > > > Example: "hello mr. shai"
> >> >
> >> > > > > > > > First token = hello -> store it and return
> >> >
> >> > > > > > > > Second token = mr -> store it and return
> >> >
> >> > > > > > > > Third token = "." -> check if "mr" is
an abbreviation, if
> so
> >> >
> >> > > don't
> >> >
> >> > > > > > > return
> >> >
> >> > > > > > > > ".".
> >> >
> >> > > > > > > > Fourth token = "shai" -> store it and
return.
> >> >
> >> > > > > > > > ...
> >> >
> >> > > > > > > >
> >> >
> >> > > > > > > > How do I store "mr" (or any of the others)?
It was easy
> w/
> >> >
> >> > > copyTo.
> >> >
> >> > > > > If I
> >> >
> >> > > > > > > > captureState, I get a State, but I can't
query it for a
> >> >
> >> > > > > TermAttribute.
> >> >
> >> > > > > > > Any
> >> >
> >> > > > > > > > ideas?
> >> >
> >> > > > > > > >
> >> >
> >> > > > > > > > Shai
> >> >
> >> > > > > > > >
> >> >
> >> > > > > > > >
> >> >
> >> > > > > > > > On Sun, Nov 22, 2009 at 3:33 PM, Uwe Schindler
> >> > <uwe@thetaphi.de>
> >> >
> >> > > > > > wrote:
> >> >
> >> > > > > > > >
> >> >
> >> > > > > > > >> Use captureState and save the state somewhere.
You can
> >> > restore
> >> >
> >> > > the
> >> >
> >> > > > > > > state
> >> >
> >> > > > > > > >> with restoreState to the TokenStream.
CachingTokenFilter
> >> does
> >> >
> >> > > > this.
> >> >
> >> > > > > > > >>
> >> >
> >> > > > > > > >> So the new API uses the State object
to put away tokens
> for
> >> >
> >> > > later
> >> >
> >> > > > > > > >> reference.
> >> >
> >> > > > > > > >>
> >> >
> >> > > > > > > >> -----
> >> >
> >> > > > > > > >> Uwe Schindler
> >> >
> >> > > > > > > >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >> >
> >> > > > > > > >> http://www.thetaphi.de
> >> >
> >> > > > > > > >> eMail: uwe@thetaphi.de
> >> >
> >> > > > > > > >>
> >> >
> >> > > > > > > >> > -----Original Message-----
> >> >
> >> > > > > > > >> > From: Shai Erera [mailto:serera@gmail.com]
> >> >
> >> > > > > > > >> > Sent: Sunday, November 22, 2009
2:29 PM
> >> >
> >> > > > > > > >> > To: java-user@lucene.apache.org
> >> >
> >> > > > > > > >> > Subject: Re: How to deal with Token
in the new TS API
> >> >
> >> > > > > > > >> >
> >> >
> >> > > > > > > >> > ok so from what I understand, I
should stop working w/
> >> > Token,
> >> >
> >> > > > and
> >> >
> >> > > > > > > move
> >> >
> >> > > > > > > >> to
> >> >
> >> > > > > > > >> > working w/ the Attributes.
> >> >
> >> > > > > > > >> >
> >> >
> >> > > > > > > >> > addAttribute indeed does not work.
Even though it does
> >> not
> >> >
> >> > > > > through
> >> >
> >> > > > > > an
> >> >
> >> > > > > > > >> > exception, if I call in.addAttribute(Token.class),
I
> get
> >> a
> >> >
> >> > > new
> >> >
> >> > > > > > > instance
> >> >
> >> > > > > > > >> of
> >> >
> >> > > > > > > >> > Token and not the once that was
added by in. So this
> is
> >> > even
> >> >
> >> > > > more
> >> >
> >> > > > > > > severe
> >> >
> >> > > > > > > >> > than just not blocking this option.
> >> >
> >> > > > > > > >> >
> >> >
> >> > > > > > > >> > I thought I can move to use addAttributeImpl,
but that
> >> > won't
> >> >
> >> > > > help
> >> >
> >> > > > > > me,
> >> >
> >> > > > > > > >> > because I won't be able to call
> >> getAttribute(Token.class).
> >> >
> >> > > > > > > >> >
> >> >
> >> > > > > > > >> > So this leaves me w/ just working
w/ the interfaces.
> >> >
> >> > > > > > > >> >
> >> >
> >> > > > > > > >> > What do I need to do in order to
clone an attribute?
> >> >
> >> > > Previously
> >> >
> >> > > > I
> >> >
> >> > > > > > > used
> >> >
> >> > > > > > > >> > token.copyTo(target). How I can
do it now if I don't
> have
> >> >
> >> > > copyTo
> >> >
> >> > > > > on
> >> >
> >> > > > > > > the
> >> >
> >> > > > > > > >> > interfaces, and/or clone?
> >> >
> >> > > > > > > >> >
> >> >
> >> > > > > > > >> > Shai
> >> >
> >> > > > > > > >> >
> >> >
> >> > > > > > > >> > On Sun, Nov 22, 2009 at 2:58 PM,
Uwe Schindler
> >> >
> >> > > <uwe@thetaphi.de
> >> >
> >> > > > >
> >> >
> >> > > > > > > wrote:
> >> >
> >> > > > > > > >> >
> >> >
> >> > > > > > > >> > > > But I do use addAttribute(Token.class),
so I don't
> >> >
> >> > > > understand
> >> >
> >> > > > > > why
> >> >
> >> > > > > > > >> you
> >> >
> >> > > > > > > >> > say
> >> >
> >> > > > > > > >> > > > it's not possible. And
I completely don't
> understand
> >> > why
> >> >
> >> > > the
> >> >
> >> > > > > new
> >> >
> >> > > > > > > API
> >> >
> >> > > > > > > >> > > > allows
> >> >
> >> > > > > > > >> > > > me to just work w/ interfaces
and not impls ... A
> >> while
> >> >
> >> > > ago
> >> >
> >> > > > I
> >> >
> >> > > > > > got
> >> >
> >> > > > > > > >> the
> >> >
> >> > > > > > > >> > > > impression that we're
trying to get rid of
> interfaces
> >> >
> >> > > > because
> >> >
> >> > > > > > > >> they're
> >> >
> >> > > > > > > >> > not
> >> >
> >> > > > > > > >> > > > easy to maintain back-compat
with ...
> >> >
> >> > > > > > > >> > >
> >> >
> >> > > > > > > >> > > AddAttribute(Token.class) should
throw an Exception,
> >> but
> >> > it
> >> >
> >> > > > > > doesn't
> >> >
> >> > > > > > > >> > (it's a
> >> >
> >> > > > > > > >> > > bug in 3.0). addAttribute should
only affect
> >> interfaces,
> >> > it
> >> >
> >> > > > > also
> >> >
> >> > > > > > > >> accepts
> >> >
> >> > > > > > > >> > > Token, because the AttributeFactory
accepts it -
> bang.
> >> >
> >> > > > > > > >> > >
> >> >
> >> > > > > > > >> > > Sorry, but you can only pass
attribute class
> literals
> >> to
> >> >
> >> > > > > > > >> > > addAttribute/getAttribute/hasAttribute
and so on.
> >> >
> >> > > > > > > >> > >
> >> >
> >> > > > > > > >> > > Sorry.
> >> >
> >> > > > > > > >> > >
> >> >
> >> > > > > > > >> > > Uwe
> >> >
> >> > > > > > > >> > >
> >> >
> >> > > > > > > >> > >
> >> >
> >> > > > > > > >> > >
> >> >
> >> > > > > >
> >> ------------------------------------------------------------------
> >> > -
> >> >
> >> > > > > > > --
> >> >
> >> > > > > > > >> > > To unsubscribe, e-mail:
> >> >
> >> > > > java-user-unsubscribe@lucene.apache.org
> >> >
> >> > > > > > > >> > > For additional commands, e-mail:
java-user-
> >> >
> >> > > > > help@lucene.apache.org
> >> >
> >> > > > > > > >> > >
> >> >
> >> > > > > > > >> > >
> >> >
> >> > > > > > > >>
> >> >
> >> > > > > > > >>
> >> >
> >> > > > > > > >>
> >> >
> >> > > > -----------------------------------------------------------------
> --
> >> >
> >> > > > > --
> >> >
> >> > > > > > > >> To unsubscribe, e-mail: java-user-
> >> > unsubscribe@lucene.apache.org
> >> >
> >> > > > > > > >> For additional commands, e-mail: java-user-
> >> >
> >> > > help@lucene.apache.org
> >> >
> >> > > > > > > >>
> >> >
> >> > > > > > > >>
> >> >
> >> > > > > > > >
> >> >
> >> > > > > >
> >> >
> >> > > > > >
> >> >
> >> > > > > >
> >> ------------------------------------------------------------------
> >> > --
> >> >
> >> > > -
> >> >
> >> > > > > > To unsubscribe, e-mail: java-user-
> unsubscribe@lucene.apache.org
> >> >
> >> > > > > > For additional commands, e-mail:
> >> java-user-help@lucene.apache.org
> >> >
> >> > > > > >
> >> >
> >> > > > > >
> >> >
> >> > > >
> >> >
> >> > > >
> >> >
> >> > > >
> >> ---------------------------------------------------------------------
> >> >
> >> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >
> >> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >
> >> > > >
> >> >
> >> > > >
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message