lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: How to deal with Token in the new TS API
Date Sun, 22 Nov 2009 19:03:03 GMT
I said, you *could* if it would be exposed. But the State is a holder class
without functionality. Because the internals are impl dependent, maybe we
will add such thing in future. But: If the state contains a real map, it
would be slow, because each captureState call would need to fill the map,
which is slow. And: If you use the Token as AttImpl, the state will only
contain one entry. You cannot control which attribute is implemented by what
impl, so the map approach would never work correct.

 

You can allocate a TermAttributeImpl and copyTo, but you should create the
instance using the same factory as the tokenstream uses:

 

TermAttribute copy = (TermAttribute)
getAttributeFactory().createAttributeInstance(TermAttribute.class);

 

By that you guarantee, that both are from the same implementation type.

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de

eMail: uwe@thetaphi.de

 

> -----Original Message-----

> From: Shai Erera [mailto:serera@gmail.com]

> Sent: Sunday, November 22, 2009 7:53 PM

> To: java-user@lucene.apache.org

> Subject: Re: How to deal with Token in the new TS API

> 

> Yes I can clone the term itself by instantiating a TermAttributeImpl,

> which

> is better than storing the String, because the latter always allocates

> char[], while the former will reuse the char[] if it's big enough.

> 

> What if State included a HashMap of all attributes, in addition to its

> "linked-list" structure?

> 

> Anyway, you mention that I can iterate on all Attributes of a State, but

> it's not clear to me how to do it, since I don't see any relevant method

> in

> its API. Am I missing something?

> 

> Shai

> 

> On Sun, Nov 22, 2009 at 4:42 PM, Uwe Schindler <uwe@thetaphi.de> wrote:

> 

> > > Because that'd mean I'll check for abbreviations for every token.

> Which

> > is

> > > a

> > > big performance loss. That way, I can just check abbr if I encountered

> a

> > > "."

> > > (not even all end-of-sentence tokens).

> >

> > OK, than simply copy the term to a String and store it. The cost is the

> > same

> > like cloning/copying. If you find the ".", use the String and look it

> up.

> >

> > > Why can't State offer a "getAttribute" like AttributeSource?

> >

> > Because State is optimized for fast restore. In previous 2.9 versions

> State

> > was itself an AttributeSource instance, but the capture/store was very,

> > very

> > slow.

> >

> > If you want to check an State, you would have need to iterate over all

> > attributes and find the correct one, which is also slow. The best is to

> > simply clone the term text as a string. You must create new objects in

> all

> > cases, even with clone/copy.

> >

> > Uwe

> >

> > > Shai

> > >

> > > On Sun, Nov 22, 2009 at 4:34 PM, Uwe Schindler <uwe@thetaphi.de>

> wrote:

> > >

> > > > If you just want to lookup if "Mr" is an abbreviation, why not look

> it

> > > up

> > > > when you handle that token and set a boolean variable in the TS

> > > > (lastTokenWasAbbreviation). When you process the ".", remove it if

> the

> > > > Boolean is set.

> > > >

> > > > Uwe

> > > >

> > > > -----

> > > > Uwe Schindler

> > > > H.-H.-Meier-Allee 63, D-28213 Bremen

> > > > http://www.thetaphi.de

> > > > eMail: uwe@thetaphi.de

> > > >

> > > >

> > > > > -----Original Message-----

> > > > > From: Shai Erera [mailto:serera@gmail.com]

> > > > > Sent: Sunday, November 22, 2009 3:28 PM

> > > > > To: java-user@lucene.apache.org

> > > > > Subject: Re: How to deal with Token in the new TS API

> > > > >

> > > > > What I've done is:

> > > > >

> > > > > State state = in.captureState();

> > > > > ...

> > > > > // Upon new call to incrementToken().

> > > > > State tmp = in.captureState();

> > > > > in.restoreState(state);

> > > > > // check if termAttribute is an abbreviation.

> > > > > If not : in.restoreState(tmp);

> > > > >

> > > > > But seems a lot of capturing/restoring to me ... how expensive is

> > > that?

> > > > >

> > > > > Shai

> > > > >

> > > > > On Sun, Nov 22, 2009 at 3:57 PM, Shai Erera <serera@gmail.com>

> > wrote:

> > > > >

> > > > > > Perhaps I misunderstand something. The current use case I'm

> trying

> > > to

> > > > > solve

> > > > > > is - I have an abbreviations TokenFilter which reads a token
and

> > > stores

> > > > > it.

> > > > > > If the next token is end-of-sentence, it checks whether the

> > previous

> > > > one

> > > > > is

> > > > > > in the abbreviations list, and discards the end-of-sentence

> token.

> > I

> > > > > need to

> > > > > > store the first token somewhere so I can reference it.

> > > > > >

> > > > > > Example: "hello mr. shai"

> > > > > > First token = hello -> store it and return

> > > > > > Second token = mr -> store it and return

> > > > > > Third token = "." -> check if "mr" is an abbreviation, if
so

> don't

> > > > > return

> > > > > > ".".

> > > > > > Fourth token = "shai" -> store it and return.

> > > > > > ...

> > > > > >

> > > > > > How do I store "mr" (or any of the others)? It was easy w/

> copyTo.

> > > If I

> > > > > > captureState, I get a State, but I can't query it for a

> > > TermAttribute.

> > > > > Any

> > > > > > ideas?

> > > > > >

> > > > > > Shai

> > > > > >

> > > > > >

> > > > > > On Sun, Nov 22, 2009 at 3:33 PM, Uwe Schindler <uwe@thetaphi.de>

> > > > wrote:

> > > > > >

> > > > > >> Use captureState and save the state somewhere. You can restore

> the

> > > > > state

> > > > > >> with restoreState to the TokenStream. CachingTokenFilter
does

> > this.

> > > > > >>

> > > > > >> So the new API uses the State object to put away tokens
for

> later

> > > > > >> reference.

> > > > > >>

> > > > > >> -----

> > > > > >> Uwe Schindler

> > > > > >> H.-H.-Meier-Allee 63, D-28213 Bremen

> > > > > >> http://www.thetaphi.de

> > > > > >> eMail: uwe@thetaphi.de

> > > > > >>

> > > > > >> > -----Original Message-----

> > > > > >> > From: Shai Erera [mailto:serera@gmail.com]

> > > > > >> > Sent: Sunday, November 22, 2009 2:29 PM

> > > > > >> > To: java-user@lucene.apache.org

> > > > > >> > Subject: Re: How to deal with Token in the new TS API

> > > > > >> >

> > > > > >> > ok so from what I understand, I should stop working
w/ Token,

> > and

> > > > > move

> > > > > >> to

> > > > > >> > working w/ the Attributes.

> > > > > >> >

> > > > > >> > addAttribute indeed does not work. Even though it does
not

> > > through

> > > > an

> > > > > >> > exception, if I call in.addAttribute(Token.class),
I get a

> new

> > > > > instance

> > > > > >> of

> > > > > >> > Token and not the once that was added by in. So this
is even

> > more

> > > > > severe

> > > > > >> > than just not blocking this option.

> > > > > >> >

> > > > > >> > I thought I can move to use addAttributeImpl, but that
won't

> > help

> > > > me,

> > > > > >> > because I won't be able to call getAttribute(Token.class).

> > > > > >> >

> > > > > >> > So this leaves me w/ just working w/ the interfaces.

> > > > > >> >

> > > > > >> > What do I need to do in order to clone an attribute?

> Previously

> > I

> > > > > used

> > > > > >> > token.copyTo(target). How I can do it now if I don't
have

> copyTo

> > > on

> > > > > the

> > > > > >> > interfaces, and/or clone?

> > > > > >> >

> > > > > >> > Shai

> > > > > >> >

> > > > > >> > On Sun, Nov 22, 2009 at 2:58 PM, Uwe Schindler

> <uwe@thetaphi.de

> > >

> > > > > wrote:

> > > > > >> >

> > > > > >> > > > But I do use addAttribute(Token.class), so
I don't

> > understand

> > > > why

> > > > > >> you

> > > > > >> > say

> > > > > >> > > > it's not possible. And I completely don't
understand why

> the

> > > new

> > > > > API

> > > > > >> > > > allows

> > > > > >> > > > me to just work w/ interfaces and not impls
... A while

> ago

> > I

> > > > got

> > > > > >> the

> > > > > >> > > > impression that we're trying to get rid of
interfaces

> > because

> > > > > >> they're

> > > > > >> > not

> > > > > >> > > > easy to maintain back-compat with ...

> > > > > >> > >

> > > > > >> > > AddAttribute(Token.class) should throw an Exception,
but it

> > > > doesn't

> > > > > >> > (it's a

> > > > > >> > > bug in 3.0). addAttribute should only affect interfaces,
it

> > > also

> > > > > >> accepts

> > > > > >> > > Token, because the AttributeFactory accepts it
- bang.

> > > > > >> > >

> > > > > >> > > Sorry, but you can only pass attribute class literals
to

> > > > > >> > > addAttribute/getAttribute/hasAttribute and so
on.

> > > > > >> > >

> > > > > >> > > Sorry.

> > > > > >> > >

> > > > > >> > > Uwe

> > > > > >> > >

> > > > > >> > >

> > > > > >> > >

> > > > -------------------------------------------------------------------

> > > > > --

> > > > > >> > > To unsubscribe, e-mail:

> > java-user-unsubscribe@lucene.apache.org

> > > > > >> > > For additional commands, e-mail: java-user-

> > > help@lucene.apache.org

> > > > > >> > >

> > > > > >> > >

> > > > > >>

> > > > > >>

> > > > > >>

> > -------------------------------------------------------------------

> > > --

> > > > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org

> > > > > >> For additional commands, e-mail: java-user-

> help@lucene.apache.org

> > > > > >>

> > > > > >>

> > > > > >

> > > >

> > > >

> > > > --------------------------------------------------------------------

> -

> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org

> > > > For additional commands, e-mail: java-user-help@lucene.apache.org

> > > >

> > > >

> >

> >

> > ---------------------------------------------------------------------

> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org

> > For additional commands, e-mail: java-user-help@lucene.apache.org

> >

> >


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message