lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <sar...@gmail.com>
Subject Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)
Date Thu, 02 Oct 2014 14:01:52 GMT
Paul,

You should also check out ICUTokenizer/DefaultICUTokenizerConfig, which adds better handling
for some languages to UAX#29 Word Break rules conformance, and also finds token boundaries
when the writing system (aka script) changes.  This is intended to be extensible per script.

The root break iterator used by DefaultICUTokenizerConfig also ignores punctuation.  You can
find its grammar at:

    lucene/analysis/icu/src/data/uax29/Default.rbbi

Steve

On Oct 1, 2014, at 4:22 PM, Paul Taylor <paul_t100@fastmail.fm> wrote:

> On 01/10/2014 18:42, Steve Rowe wrote:
>> Paul,
>> 
>> Boilerplate upgrade recommendation: consider using the most recent Lucene release
(4.10.1) - it’s the most stable, performant, and featureful release available, and many
bugs have been fixed since the 4.1 release.
> Yeah sure, I did try this and hit a load of errors but I certainly will do so.
>> FYI, StandardTokenizer doesn’t find word boundaries for Chinese, Japanese, Korean,
Thai, and other languages that don’t use whitespace to denote word boundaries, except those
around punctuation.  Note that Lucene 4.1 does have specialized tokenizers for Simplified
Chinese and Japanese: the smartcn and kuromoji analysis modules, respectively.
> So for Chinese, Japanese, Korean, Thai etc its just identifying that the chars are from
said language, and then we can do something clever with it with subsequent filters such as
CJBigramFilter right ?
> My big trouble is my code is meant to deal with any language  and I dont know what language
it in except by looking at the characters themselves  AND i also have to deal with stuff that
contains symbols, funny punctuation etc
>> It is possible to construct a tokenizer just based on pure java code - there are
several examples of this in Lucene 4.1, see e.g. PatternTokenizer, and CharTokenizer and its
subclasses WhitespaceTokenizer and LetterTokenizer.
>> 
> Ah yes I discovered this today, what I would really like is a version of the jflex StandardTokenizer
but written in pure Java making it easier to tweak it, but I'm a little concerned that If
I naively write it from scratch I may create something that doesnt perform very well.
> 
> Paul
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message