lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yuri Jan" <vao...@gmail.com>
Subject Re: Different tokenizing algorithms for the same stream
Date Fri, 07 Nov 2008 16:23:18 GMT
I'm subclassing my own tokenizer.
I'm not sure though if I can rely on the fact this tokenizer will be used
for this field sequentially.
I'm going to use it with different fields and doesn't want the member
variable to be used when tokenizing different fields or even the same field
on different docs.

In other words - can I assume that as long as I don't reach the input Reader
end of stream my tokenizer will be used only with this specific stream?
If the answer is yes, adding a status member variable is indeed my solution.

Thanks,
-J

On Fri, Nov 7, 2008 at 10:51 AM, Jérôme Etévé <jerome.eteve@gmail.com>wrote:

> Hi,
>
>  you have to keep track of the character position yourself in your
> custom Tokenizer.
>
>  See org.apache.lucene.analysis.CharTokenizer for a starting example.
>
>  Cheers,
>
>  J.
>
>
> On Fri, Nov 7, 2008 at 3:33 PM, Yoav Caspi <yoavca@gmail.com> wrote:
> > Thanks, Jerome.
> >
> > My problem is that in Token next(Token result) there is no information
> about
> > the location inside the stream.
> > I can read characters from the input Reader, but couldn't find a way to
> know
> > if it's the beginning of the input or not.
> >
> > -J
> >
> > On Fri, Nov 7, 2008 at 6:13 AM, Jérôme Etévé <jerome.eteve@gmail.com>
> wrote:
> >>
> >> Hi,
> >>
> >>  I think you could implement your personalized tokenizer in a way it
> >> changes its behaviour after it has delivered X tokens.
> >>
> >> This implies a new tokenizer instance is build from the factory for
> >> every string analyzed, which I believe is true.
> >>
> >> Can this be confirmed ?
> >>
> >> Cheers !
> >>
> >> Jerome.
> >>
> >>
> >> On Thu, Nov 6, 2008 at 11:08 PM, Yuri Jan <vaoyca@gmail.com> wrote:
> >> > Hello all,
> >> >
> >> > I'm trying to implement a tokenizer that will behave differently on
> >> > different parts of the incoming stream.
> >> > For example, for the first X words in the stream I would like to use
> one
> >> > tokenizing algorithm, while for the rest of the stream a different
> >> > tokenizing algorithm will be used.
> >> >
> >> > What is the best way to implement that?
> >> > Where should I store this stream-related data?
> >> >
> >> > Thanks,
> >> > Yuri
> >> >
> >>
> >>
> >>
> >> --
> >> Jerome Eteve.
> >>
> >> Chat with me live at http://www.eteve.net
> >>
> >> jerome@eteve.net
> >
> >
>
>
>
> --
> Jerome Eteve.
>
> Chat with me live at http://www.eteve.net
>
> jerome@eteve.net
>
>
>
> --
> Jerome Eteve.
>
> Chat with me live at http://www.eteve.net
>
> jerome@eteve.net
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message