lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <sar...@gmail.com>
Subject Re: Custom TokenStream + custom Attributes
Date Tue, 31 May 2016 19:38:20 GMT
Hi Michal,

Please repost on the lucene-user list.  general@l.a.o has fewer subscribers, and it’s not
focussed on Lucene usage questions.

More info: <http://lucene.apache.org/core/discussion.html#java-user-list-java-userlucene>

--
Steve
www.lucidworks.com

> On May 31, 2016, at 9:58 AM, Michal Krajňanský <michal.krajnansky@gmail.com>
wrote:
> 
> Dear Lucene users,
> 
> I have implemented a custom tokenizer (derived from TokenStream).
> 
> I need to pass additional attributes to those standard in Lucene
> (PositionIncrementAttribute, OffsetAttribute), that would represent the
> word position in the tokenized sentence in the number of words and not
> characters, as one usually passes through OffsetAttribute. (I need both.)
> 
> Is there a way of achieving this?
> 
> I tried to implement own Attribute class (derive a new interface and
> implementing class). The code compiles ok but I am getting exception at
> runtime about the class casting.
> 
> Thank you a lot in advance,
> 
> 
> MK
> 
> 
> 
> FYI the code looks like this:
> 
> /**
> *
> */
> package com.newstin.nlp.analysis;
> 
> import java.io.IOException;
> import java.util.Iterator;
> import java.util.List;
> 
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
> import
> org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
> 
> /**
> * @author michal
> */
> public class TermsListTokenizer extends TokenStream
> {
>    private final CharTermAttribute termAtt =
> addAttribute(CharTermAttribute.class);
>    private final OffsetAttribute offsetAtt =
> addAttribute(OffsetAttribute.class);
>    private final PositionIncrementAttribute posIncrAtt =
> addAttribute(PositionIncrementAttribute.class);
> 
>    private final Iterator<Term> termIterator;
>    private int lastTermPos;
> 
>    public TermsListTokenizer(List<Term> terms)
>    {
>        termIterator = terms.iterator();
>        lastTermPos = -1;
>    }
> 
>    @Override
>    public boolean incrementToken() throws IOException
>    {
>        clearAttributes();
> 
>        // TODO: check: compute the positions right for term variants !!!
>        if (termIterator.hasNext()) {
>            Term term = termIterator.next();
> 
>            termAtt.append(term.getTerm());
>            offsetAtt.setOffset(term.getStart(), term.getEnd()); // need to
> also save position in the number of words
>            posIncrAtt.setPositionIncrement(term.getWordIndex() -
> lastTermPos);
>            lastTermPos = term.getWordIndex();
>            return true;
>        }
> 
>        return false;
>    }
> }


Mime
View raw message