lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: How to deal with Token in the new TS API
Date Sun, 22 Nov 2009 12:30:20 GMT
Thanks Uwe for the response, however that doesn't get me anywhere. I already
know that Token is added once, and that after I add Token I cannot add more
of them. And I understand why the double printing.

I want to add Token.class, and then work w/ Token. Not TermAttribute,
PosIncrAttribute, OffsetAttribute, PayloadAttribute and TypeAttribute (these
are the five attributes I'm using from Token). Why can't the code add Token
to the attributes map? If all of these are anyway mapped to the same
instance, what problems will it cause?

What I'll do for now is call addAttribute(Token.class) which will return me
a Token. But, per the other thread, this behavior is buggy IMO, because I'd
then rely on the input TS to support Token, which may not be the cases ...
So perhaps I can move to check whether all the attributes that I care about
are there. But this just complicates the code. If Token was added to the
Attributes map, I wouldn't need to do this juggling ...

Shai

On Sun, Nov 22, 2009 at 2:10 PM, Uwe Schindler <uwe@thetaphi.de> wrote:

> > To add to my previous email, If I do the following:
> >
> > StringReader sr = new StringReader("hello world");
> > TokenStream ts = new WhitespaceTokenizer(Token.TOKEN_ATTRIBUTE_FACTORY,
> > sr);
> >
> > for (Iterator<Class<? extends Attribute>> iter =
> > ts.getAttributeClassesIterator(); iter.hasNext();) {
> >   Class< ? extends Attribute> type = iter.next();
> >   System.out.println(type);
> > }
> >
> > TermAttribute ta = ts.getAttribute(TermAttribute.class);
> > OffsetAttribute oa = ts.getAttribute(OffsetAttribute.class);
> >
> > while (ts.incrementToken()) {
> >   System.out.println(ta + " " + oa);
> > }
> >
> > Then it prints:
> >
> > interface org.apache.lucene.analysis.tokenattributes.TermAttribute
> > interface org.apache.lucene.analysis.tokenattributes.TypeAttribute
> > interface
> > org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute
> > interface org.apache.lucene.analysis.tokenattributes.FlagsAttribute
> > interface org.apache.lucene.analysis.tokenattributes.OffsetAttribute
> > interface org.apache.lucene.analysis.tokenattributes.PayloadAttribute
> > (hello,0,5) (hello,0,5)
> > (world,6,11) (world,6,11)
>
> That is correct, because you are iterating the attribute instances.
>
> > Reason for all the attributes - I use Token.TOKEN_ATTRIBUTE_FACTORY.
> > WhitespaceTokenizer, through CharTokenizer, adds just Term and Offset
> > attributes. However, TokenAttributeFactory's createAttributeInstance code
> > adds Token itself every time. That's because the code:
> >
> > return attClass.isAssignableFrom(Token.class) ? new Token() :
> > delegate.createAttributeInstance(attClass);
> >
> > always returns new Token(), since every Token can be assigned to
> > TermAttribute or OffsetAttribute. Shouldn't it be the other way around?
>
> No that is exactly correct. If you add a TermAttribute to the TS, and use
> the Token attribute afctory, it *must* add all implemented attributes. And
> this is the reason, why you cannot relay on the fact, that (unused)
> attributes may not be already be in the TS. And by the way, Token is only
> added once to the TS, all 6 attributes (after a call to addAttribute) will
> return the same instance!
>
> > I.e., we want to add Tokens, not classes Token implements. So I thin it
> > should be Token.class.isAssignableFrom(attCls), and so only sub-classes
> on
> > Token will get added by this factory, otherwise it'll call the delegate?
>
> The AttributeSource only allows *one* instance per impl, so if you add one
> Token you cannot add more of them. Other way round the TS will then have
> all
> attributes, Token implements automatically.
>
> > Reason for the double printing ... the actual instance that gets added to
> > the map is of Token. Therefore regardless if I call
> > getAttribute(TermAttribute) or getAttribute(OffsetAttribute), I get the
> > Token instance. And when I print it, it calls Token.toString().
>
> The double printing cannot be removed. The simpliest it to use
> TokenStream.toString() instead, it will present you a full snapshot as
> string. This is exactly the case, why Attribute does not implement
> toString(). The println works, because javac casts to (Object).
>
> > It's strange ... I can't "addA(Token) -- hasA(Token)" but I can
> > "addA(Token)
> > -- hasA(Term) -- getA(Term) -- cast to Token" ...
> >
> > I don't know if this is a bug or not, but it's strange.
> >
> > Shai
> >
> > On Sun, Nov 22, 2009 at 1:12 PM, Shai Erera <serera@gmail.com> wrote:
> >
> > > Hi
> > >
> > > I started to migrate my Analyzers, Tokenizer, TokenStreams and
> > TokenFilters
> > > to the new API. Since the entire set of classes handled Token before, I
> > > decided to not change it for now, and was happy to discover that Token
> > > extends AttributeImpl, which makes the migration easier.
> > >
> > > So I started w/ my Tokenizer. I had a "private final Token token =
> > > addAttribute(Token.class);" line. I got startled when I received
> > > "java.lang.IllegalArgumentException: Could not find implementing class
> > for
> > > org.apache.lucene.analysis.Token". I checked my classpath, tried to run
> > from
> > > eclipse and cmd-line, nada. I then checked the source code, and
> > discovered
> > > that the default attribute factory adds an "Impl" to the class name.
> So:
> > >
> > > 1) Phew ... nothing's wrong w/ my classpath.
> > > 2) Mental note - read the documentation more closely: in package.html
> > it's
> > > said that if you implement an Attribute, make sure to add Impl to its
> > class
> > > name, or otherwise you'll need to provide your own AttributeFactory.
> > > 3) But, why is the exception so vague? If Lucene adds "Impl" to the
> > class
> > > name that I pass, shouldn't it also say that "... class for
> > ....NameImpl"?
> > > That way, I'd see TokenImpl and immediately figure out that I should
> > read
> > > the documentation.
> > >
> > > I then went on to read about AttributeFactory, and was wondering in the
> > > process why the hell do I need to implement one which is marked EXPERT
> > > whereas I use a "basic" Lucene class, when I discovered that Token
> > includes
> > > a TokenAttributeFactory. So:
> > >
> > > 1) Good ! I don't need to implement an AttributeFactory.
> > > 2) Why isn't it mentioned in the documentation? If Token was kept for
> > easy
> > > migration from pre-2.9 API, I'd expect this to appear very clearly in
> > > package.html. Something like "if you're migrating from pre-2.9 API and
> > would
> > > like to keep using Token, MAKE SURE TO CALL
> > > super(Token.TOKEN_ATTRIBUTE_FACTORY) IN YOUR TOKENIZER". Something like
> > > that, maybe with less upper-casing.
> > >
> > > I went on and moved the addAttribute line to inside the ctor, after I
> > call
> > > super(...). But then something else hit me. In my TokenFilters I call
> > > input.hasAttribute(Token.class) to ensure the input TS will process
> > Token. I
> > > was surprised to find out this method returns 'false'. Debug-tracing
> the
> > > code I discovered that when I call addAttribute, all the Attribute
> > classes
> > > Token implements are added to the map, but not Token itself. So:
> > >
> > > 1) Hmmm ... not so easy to migrate my Token-based API to the new API
> ...
> > > 2) I assume getAttribute(Token.class) won't work either ... so what
> > benefit
> > > did I get from calling addAttribute(Token.class) in the first place?
> Now
> > I
> > > need, in my consumer API, to rebuild a Token on every incrementToken
> > call?
> > > 3) Isn't that a crime? I added X and called has(X) and got false ...
> > again
> > > documentation could help, but I get a sense that this is buggy
> behavior.
> > >
> > > Before you answer that I can call getAttribute(TermAttribute.class),
> > > remember that I started this email as a user that wants to migrate to a
> > new
> > > API, and the documentation says I can use Token for easier migration.
> So
> > > using all the other attributes is a less preferred option now,
> > especially as
> > > I'm not going to introduce, at the moment, new attributes, but just
> > continue
> > > to work with the 'default' ones.
> > >
> > > Any help will be appreciated. I really hope I'm missing something basic
> > ...
> > >
> > > Shai
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message