Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of paul_t100@fastmail.fm
 designates 66.111.4.29 as permitted sender)
Message-ID: <542C629A.9070802@fastmail.fm>
Date: Wed, 01 Oct 2014 21:22:50 +0100
From: Paul Taylor <paul_t100@fastmail.fm>
Reply-To: paul_t100@fastmail.fm
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64;
 rv:24.0) Gecko/20100101 Thunderbird/24.6.0
MIME-Version: 1.0
To: java-user@lucene.apache.org
Subject: Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)
References: <542B0A82.8080902@fastmail.fm>
 <B997D455-6ABC-414C-B06B-BD29561E3405@gmail.com>
 <CAM21Rt-A4sseTotuwFjMuGMr0q43ES37=p5-dBO4XS_WyiP5Ag@mail.gmail.com>
 <542BB590.5080003@fastmail.fm>
 <3EBF66D8-6F85-46BF-8FEE-5925C83ED4E0@gmail.com>
In-Reply-To: <3EBF66D8-6F85-46BF-8FEE-5925C83ED4E0@gmail.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit

On 01/10/2014 18:42, Steve Rowe wrote:
> Paul,
>
> Boilerplate upgrade recommendation: consider using the most recent Lucene release (4.10.1) - it�s the most stable, performant, and featureful release available, and many bugs have been fixed since the 4.1 release.
Yeah sure, I did try this and hit a load of errors but I certainly will 
do so.
> FYI, StandardTokenizer doesn�t find word boundaries for Chinese, Japanese, Korean, Thai, and other languages that don�t use whitespace to denote word boundaries, except those around punctuation.  Note that Lucene 4.1 does have specialized tokenizers for Simplified Chinese and Japanese: the smartcn and kuromoji analysis modules, respectively.
So for Chinese, Japanese, Korean, Thai etc its just identifying that the 
chars are from said language, and then we can do something clever with 
it with subsequent filters such as CJBigramFilter right ?
My big trouble is my code is meant to deal with any language  and I dont 
know what language it in except by looking at the characters themselves  
AND i also have to deal with stuff that contains symbols, funny 
punctuation etc
> It is possible to construct a tokenizer just based on pure java code - there are several examples of this in Lucene 4.1, see e.g. PatternTokenizer, and CharTokenizer and its subclasses WhitespaceTokenizer and LetterTokenizer.
>
Ah yes I discovered this today, what I would really like is a version of 
the jflex StandardTokenizer but written in pure Java making it easier to 
tweak it, but I'm a little concerned that If I naively write it from 
scratch I may create something that doesnt perform very well.

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org