lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <simon.willna...@googlemail.com>
Subject Re: Splitting word tokens - other languages
Date Sat, 19 Feb 2011 23:23:37 GMT
Hey,

I am not an expert on this but I think you should look into
CJKAnalyzer / CJKTokenizer

simon

On Thu, Feb 17, 2011 at 8:05 PM, CassUser CassUser <cassuser@gmail.com> wrote:
> Hey all,
>
> I'm somewhat new to Lucene.  Meaning I used it some time ago for a parser we
> wrote to tokenize a document into word grams.
>
> the approach I took was simple as follows:
>
> 1. extended the lucene Analyzer
> 2. In the tokenStream method use ShingleMatrixFilter.  Passed in the
> standard tokenizer, and shingle min/max/splitter.
>
> This worked pretty well for us.  Now we would like to tokenize hangul/korean
> into word grams.
>
> I'm curious others have done something similar and would share their
> experience.  Any pointers to get started with this would be great.
>
> Thanks.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message