lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jérôme Etévé" <jerome.et...@gmail.com>
Subject Different tokenizing algorithms for the same stream
Date Fri, 07 Nov 2008 15:51:24 GMT
Hi,

  you have to keep track of the character position yourself in your
custom Tokenizer.

  See org.apache.lucene.analysis.CharTokenizer for a starting example.

  Cheers,

  J.


On Fri, Nov 7, 2008 at 3:33 PM, Yoav Caspi <yoavca@gmail.com> wrote:
> Thanks, Jerome.
>
> My problem is that in Token next(Token result) there is no information about
> the location inside the stream.
> I can read characters from the input Reader, but couldn't find a way to know
> if it's the beginning of the input or not.
>
> -J
>
> On Fri, Nov 7, 2008 at 6:13 AM, Jérôme Etévé <jerome.eteve@gmail.com> wrote:
>>
>> Hi,
>>
>>  I think you could implement your personalized tokenizer in a way it
>> changes its behaviour after it has delivered X tokens.
>>
>> This implies a new tokenizer instance is build from the factory for
>> every string analyzed, which I believe is true.
>>
>> Can this be confirmed ?
>>
>> Cheers !
>>
>> Jerome.
>>
>>
>> On Thu, Nov 6, 2008 at 11:08 PM, Yuri Jan <vaoyca@gmail.com> wrote:
>> > Hello all,
>> >
>> > I'm trying to implement a tokenizer that will behave differently on
>> > different parts of the incoming stream.
>> > For example, for the first X words in the stream I would like to use one
>> > tokenizing algorithm, while for the rest of the stream a different
>> > tokenizing algorithm will be used.
>> >
>> > What is the best way to implement that?
>> > Where should I store this stream-related data?
>> >
>> > Thanks,
>> > Yuri
>> >
>>
>>
>>
>> --
>> Jerome Eteve.
>>
>> Chat with me live at http://www.eteve.net
>>
>> jerome@eteve.net
>
>



--
Jerome Eteve.

Chat with me live at http://www.eteve.net

jerome@eteve.net



-- 
Jerome Eteve.

Chat with me live at http://www.eteve.net

jerome@eteve.net
Mime
View raw message