lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: How to deal with Token in the new TS API
Date Sun, 22 Nov 2009 19:14:46 GMT
Did you mean something like:

TermAttributeImpl termBuf = (TermAttributeImpl)
input.getAttributeFactory().createAttributeInstance(TermAttribute.class);

I need to use the methods on TermAttributeImpl like clear() ...

Shai

On Sun, Nov 22, 2009 at 9:03 PM, Uwe Schindler <uwe@thetaphi.de> wrote:

> I said, you *could* if it would be exposed. But the State is a holder class
> without functionality. Because the internals are impl dependent, maybe we
> will add such thing in future. But: If the state contains a real map, it
> would be slow, because each captureState call would need to fill the map,
> which is slow. And: If you use the Token as AttImpl, the state will only
> contain one entry. You cannot control which attribute is implemented by
> what
> impl, so the map approach would never work correct.
>
>
>
> You can allocate a TermAttributeImpl and copyTo, but you should create the
> instance using the same factory as the tokenstream uses:
>
>
>
> TermAttribute copy = (TermAttribute)
> getAttributeFactory().createAttributeInstance(TermAttribute.class);
>
>
>
> By that you guarantee, that both are from the same implementation type.
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: uwe@thetaphi.de
>
>
>
> > -----Original Message-----
>
> > From: Shai Erera [mailto:serera@gmail.com]
>
> > Sent: Sunday, November 22, 2009 7:53 PM
>
> > To: java-user@lucene.apache.org
>
> > Subject: Re: How to deal with Token in the new TS API
>
> >
>
> > Yes I can clone the term itself by instantiating a TermAttributeImpl,
>
> > which
>
> > is better than storing the String, because the latter always allocates
>
> > char[], while the former will reuse the char[] if it's big enough.
>
> >
>
> > What if State included a HashMap of all attributes, in addition to its
>
> > "linked-list" structure?
>
> >
>
> > Anyway, you mention that I can iterate on all Attributes of a State, but
>
> > it's not clear to me how to do it, since I don't see any relevant method
>
> > in
>
> > its API. Am I missing something?
>
> >
>
> > Shai
>
> >
>
> > On Sun, Nov 22, 2009 at 4:42 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
>
> >
>
> > > > Because that'd mean I'll check for abbreviations for every token.
>
> > Which
>
> > > is
>
> > > > a
>
> > > > big performance loss. That way, I can just check abbr if I
> encountered
>
> > a
>
> > > > "."
>
> > > > (not even all end-of-sentence tokens).
>
> > >
>
> > > OK, than simply copy the term to a String and store it. The cost is the
>
> > > same
>
> > > like cloning/copying. If you find the ".", use the String and look it
>
> > up.
>
> > >
>
> > > > Why can't State offer a "getAttribute" like AttributeSource?
>
> > >
>
> > > Because State is optimized for fast restore. In previous 2.9 versions
>
> > State
>
> > > was itself an AttributeSource instance, but the capture/store was very,
>
> > > very
>
> > > slow.
>
> > >
>
> > > If you want to check an State, you would have need to iterate over all
>
> > > attributes and find the correct one, which is also slow. The best is to
>
> > > simply clone the term text as a string. You must create new objects in
>
> > all
>
> > > cases, even with clone/copy.
>
> > >
>
> > > Uwe
>
> > >
>
> > > > Shai
>
> > > >
>
> > > > On Sun, Nov 22, 2009 at 4:34 PM, Uwe Schindler <uwe@thetaphi.de>
>
> > wrote:
>
> > > >
>
> > > > > If you just want to lookup if "Mr" is an abbreviation, why not look
>
> > it
>
> > > > up
>
> > > > > when you handle that token and set a boolean variable in the TS
>
> > > > > (lastTokenWasAbbreviation). When you process the ".", remove it if
>
> > the
>
> > > > > Boolean is set.
>
> > > > >
>
> > > > > Uwe
>
> > > > >
>
> > > > > -----
>
> > > > > Uwe Schindler
>
> > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
>
> > > > > http://www.thetaphi.de
>
> > > > > eMail: uwe@thetaphi.de
>
> > > > >
>
> > > > >
>
> > > > > > -----Original Message-----
>
> > > > > > From: Shai Erera [mailto:serera@gmail.com]
>
> > > > > > Sent: Sunday, November 22, 2009 3:28 PM
>
> > > > > > To: java-user@lucene.apache.org
>
> > > > > > Subject: Re: How to deal with Token in the new TS API
>
> > > > > >
>
> > > > > > What I've done is:
>
> > > > > >
>
> > > > > > State state = in.captureState();
>
> > > > > > ...
>
> > > > > > // Upon new call to incrementToken().
>
> > > > > > State tmp = in.captureState();
>
> > > > > > in.restoreState(state);
>
> > > > > > // check if termAttribute is an abbreviation.
>
> > > > > > If not : in.restoreState(tmp);
>
> > > > > >
>
> > > > > > But seems a lot of capturing/restoring to me ... how expensive
is
>
> > > > that?
>
> > > > > >
>
> > > > > > Shai
>
> > > > > >
>
> > > > > > On Sun, Nov 22, 2009 at 3:57 PM, Shai Erera <serera@gmail.com>
>
> > > wrote:
>
> > > > > >
>
> > > > > > > Perhaps I misunderstand something. The current use case
I'm
>
> > trying
>
> > > > to
>
> > > > > > solve
>
> > > > > > > is - I have an abbreviations TokenFilter which reads a
token
> and
>
> > > > stores
>
> > > > > > it.
>
> > > > > > > If the next token is end-of-sentence, it checks whether
the
>
> > > previous
>
> > > > > one
>
> > > > > > is
>
> > > > > > > in the abbreviations list, and discards the end-of-sentence
>
> > token.
>
> > > I
>
> > > > > > need to
>
> > > > > > > store the first token somewhere so I can reference it.
>
> > > > > > >
>
> > > > > > > Example: "hello mr. shai"
>
> > > > > > > First token = hello -> store it and return
>
> > > > > > > Second token = mr -> store it and return
>
> > > > > > > Third token = "." -> check if "mr" is an abbreviation,
if so
>
> > don't
>
> > > > > > return
>
> > > > > > > ".".
>
> > > > > > > Fourth token = "shai" -> store it and return.
>
> > > > > > > ...
>
> > > > > > >
>
> > > > > > > How do I store "mr" (or any of the others)? It was easy
w/
>
> > copyTo.
>
> > > > If I
>
> > > > > > > captureState, I get a State, but I can't query it for a
>
> > > > TermAttribute.
>
> > > > > > Any
>
> > > > > > > ideas?
>
> > > > > > >
>
> > > > > > > Shai
>
> > > > > > >
>
> > > > > > >
>
> > > > > > > On Sun, Nov 22, 2009 at 3:33 PM, Uwe Schindler <
> uwe@thetaphi.de>
>
> > > > > wrote:
>
> > > > > > >
>
> > > > > > >> Use captureState and save the state somewhere. You
can restore
>
> > the
>
> > > > > > state
>
> > > > > > >> with restoreState to the TokenStream. CachingTokenFilter
does
>
> > > this.
>
> > > > > > >>
>
> > > > > > >> So the new API uses the State object to put away tokens
for
>
> > later
>
> > > > > > >> reference.
>
> > > > > > >>
>
> > > > > > >> -----
>
> > > > > > >> Uwe Schindler
>
> > > > > > >> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> > > > > > >> http://www.thetaphi.de
>
> > > > > > >> eMail: uwe@thetaphi.de
>
> > > > > > >>
>
> > > > > > >> > -----Original Message-----
>
> > > > > > >> > From: Shai Erera [mailto:serera@gmail.com]
>
> > > > > > >> > Sent: Sunday, November 22, 2009 2:29 PM
>
> > > > > > >> > To: java-user@lucene.apache.org
>
> > > > > > >> > Subject: Re: How to deal with Token in the new
TS API
>
> > > > > > >> >
>
> > > > > > >> > ok so from what I understand, I should stop working
w/
> Token,
>
> > > and
>
> > > > > > move
>
> > > > > > >> to
>
> > > > > > >> > working w/ the Attributes.
>
> > > > > > >> >
>
> > > > > > >> > addAttribute indeed does not work. Even though
it does not
>
> > > > through
>
> > > > > an
>
> > > > > > >> > exception, if I call in.addAttribute(Token.class),
I get a
>
> > new
>
> > > > > > instance
>
> > > > > > >> of
>
> > > > > > >> > Token and not the once that was added by in. So
this is even
>
> > > more
>
> > > > > > severe
>
> > > > > > >> > than just not blocking this option.
>
> > > > > > >> >
>
> > > > > > >> > I thought I can move to use addAttributeImpl,
but that won't
>
> > > help
>
> > > > > me,
>
> > > > > > >> > because I won't be able to call getAttribute(Token.class).
>
> > > > > > >> >
>
> > > > > > >> > So this leaves me w/ just working w/ the interfaces.
>
> > > > > > >> >
>
> > > > > > >> > What do I need to do in order to clone an attribute?
>
> > Previously
>
> > > I
>
> > > > > > used
>
> > > > > > >> > token.copyTo(target). How I can do it now if I
don't have
>
> > copyTo
>
> > > > on
>
> > > > > > the
>
> > > > > > >> > interfaces, and/or clone?
>
> > > > > > >> >
>
> > > > > > >> > Shai
>
> > > > > > >> >
>
> > > > > > >> > On Sun, Nov 22, 2009 at 2:58 PM, Uwe Schindler
>
> > <uwe@thetaphi.de
>
> > > >
>
> > > > > > wrote:
>
> > > > > > >> >
>
> > > > > > >> > > > But I do use addAttribute(Token.class),
so I don't
>
> > > understand
>
> > > > > why
>
> > > > > > >> you
>
> > > > > > >> > say
>
> > > > > > >> > > > it's not possible. And I completely
don't understand why
>
> > the
>
> > > > new
>
> > > > > > API
>
> > > > > > >> > > > allows
>
> > > > > > >> > > > me to just work w/ interfaces and not
impls ... A while
>
> > ago
>
> > > I
>
> > > > > got
>
> > > > > > >> the
>
> > > > > > >> > > > impression that we're trying to get
rid of interfaces
>
> > > because
>
> > > > > > >> they're
>
> > > > > > >> > not
>
> > > > > > >> > > > easy to maintain back-compat with ...
>
> > > > > > >> > >
>
> > > > > > >> > > AddAttribute(Token.class) should throw an
Exception, but
> it
>
> > > > > doesn't
>
> > > > > > >> > (it's a
>
> > > > > > >> > > bug in 3.0). addAttribute should only affect
interfaces,
> it
>
> > > > also
>
> > > > > > >> accepts
>
> > > > > > >> > > Token, because the AttributeFactory accepts
it - bang.
>
> > > > > > >> > >
>
> > > > > > >> > > Sorry, but you can only pass attribute class
literals to
>
> > > > > > >> > > addAttribute/getAttribute/hasAttribute and
so on.
>
> > > > > > >> > >
>
> > > > > > >> > > Sorry.
>
> > > > > > >> > >
>
> > > > > > >> > > Uwe
>
> > > > > > >> > >
>
> > > > > > >> > >
>
> > > > > > >> > >
>
> > > > > -------------------------------------------------------------------
>
> > > > > > --
>
> > > > > > >> > > To unsubscribe, e-mail:
>
> > > java-user-unsubscribe@lucene.apache.org
>
> > > > > > >> > > For additional commands, e-mail: java-user-
>
> > > > help@lucene.apache.org
>
> > > > > > >> > >
>
> > > > > > >> > >
>
> > > > > > >>
>
> > > > > > >>
>
> > > > > > >>
>
> > > -------------------------------------------------------------------
>
> > > > --
>
> > > > > > >> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
>
> > > > > > >> For additional commands, e-mail: java-user-
>
> > help@lucene.apache.org
>
> > > > > > >>
>
> > > > > > >>
>
> > > > > > >
>
> > > > >
>
> > > > >
>
> > > > >
> --------------------------------------------------------------------
>
> > -
>
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
>
> > > > >
>
> > > > >
>
> > >
>
> > >
>
> > > ---------------------------------------------------------------------
>
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
>
> > >
>
> > >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message