lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From CassUser CassUser <>
Subject Splitting word tokens - other languages
Date Thu, 17 Feb 2011 19:05:45 GMT
Hey all,

I'm somewhat new to Lucene.  Meaning I used it some time ago for a parser we
wrote to tokenize a document into word grams.

the approach I took was simple as follows:

1. extended the lucene Analyzer
2. In the tokenStream method use ShingleMatrixFilter.  Passed in the
standard tokenizer, and shingle min/max/splitter.

This worked pretty well for us.  Now we would like to tokenize hangul/korean
into word grams.

I'm curious others have done something similar and would share their
experience.  Any pointers to get started with this would be great.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message