Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6F12617E3A for ; Wed, 1 Oct 2014 20:23:25 +0000 (UTC) Received: (qmail 60982 invoked by uid 500); 1 Oct 2014 20:23:23 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 60920 invoked by uid 500); 1 Oct 2014 20:23:23 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 60903 invoked by uid 99); 1 Oct 2014 20:23:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Oct 2014 20:23:23 +0000 X-ASF-Spam-Status: No, hits=-0.2 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_REPLYTO_END_DIGIT,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of paul_t100@fastmail.fm designates 66.111.4.29 as permitted sender) Received: from [66.111.4.29] (HELO out5-smtp.messagingengine.com) (66.111.4.29) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Oct 2014 20:22:56 +0000 Received: from compute4.internal (compute4.nyi.internal [10.202.2.44]) by gateway2.nyi.internal (Postfix) with ESMTP id 743D320332 for ; Wed, 1 Oct 2014 16:22:55 -0400 (EDT) Received: from frontend2 ([10.202.2.161]) by compute4.internal (MEProxy); Wed, 01 Oct 2014 16:22:55 -0400 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=fastmail.fm; h= x-sasl-enc:message-id:date:from:reply-to:mime-version:to:subject :references:in-reply-to:content-type:content-transfer-encoding; s=mesmtp; bh=2I3UKET7NH95Qn2rfutwcZz++bE=; b=iHzNVOli4UsMk8mubI sorhV1kedI6LRe6fGU2lKJcNiqyTwXaMcCZT47bW+F5sAKMEKyWXzf/btKnE2Ism SihhlpgbxtP3vEcElgBoyvpMoMnbuL0np0yhXp88rHhd56Q+yC+NzEERI3UYiE4O AseQLAig3FRfy5j+ftk0Gi6aQ= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d= messagingengine.com; h=x-sasl-enc:message-id:date:from:reply-to :mime-version:to:subject:references:in-reply-to:content-type :content-transfer-encoding; s=smtpout; bh=2I3UKET7NH95Qn2rfutwcZ z++bE=; b=OjYB2Y7zoyePeMjZaVIjTHuWJhfUhvgrXaAhQSzvzXn026cQiJFS87 8nAYFAkhXxTSB21S4RClYPCFyDAPJ9rDVPX3zkRjnWfu1IMp+lMwJw2hhWBred2r U7Vw+Oh9zm5e4F9trnDvgYV+rAHKZPVS7dAU/QIQQc8OGdfiYhtu4= X-Sasl-enc: 6Z+NTajtmneRqlGXaIfEeqk/TEaFGKfaA3oYO0h1SOSE 1412194975 Received: from [192.168.1.67] (unknown [217.155.98.246]) by mail.messagingengine.com (Postfix) with ESMTPA id 02C6B680153 for ; Wed, 1 Oct 2014 16:22:54 -0400 (EDT) Message-ID: <542C629A.9070802@fastmail.fm> Date: Wed, 01 Oct 2014 21:22:50 +0100 From: Paul Taylor Reply-To: paul_t100@fastmail.fm User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Does StandardTokenizer remove punctuation (in Lucene 4.1) References: <542B0A82.8080902@fastmail.fm> <542BB590.5080003@fastmail.fm> <3EBF66D8-6F85-46BF-8FEE-5925C83ED4E0@gmail.com> In-Reply-To: <3EBF66D8-6F85-46BF-8FEE-5925C83ED4E0@gmail.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org On 01/10/2014 18:42, Steve Rowe wrote: > Paul, > > Boilerplate upgrade recommendation: consider using the most recent Lucene release (4.10.1) - it�s the most stable, performant, and featureful release available, and many bugs have been fixed since the 4.1 release. Yeah sure, I did try this and hit a load of errors but I certainly will do so. > FYI, StandardTokenizer doesn�t find word boundaries for Chinese, Japanese, Korean, Thai, and other languages that don�t use whitespace to denote word boundaries, except those around punctuation. Note that Lucene 4.1 does have specialized tokenizers for Simplified Chinese and Japanese: the smartcn and kuromoji analysis modules, respectively. So for Chinese, Japanese, Korean, Thai etc its just identifying that the chars are from said language, and then we can do something clever with it with subsequent filters such as CJBigramFilter right ? My big trouble is my code is meant to deal with any language and I dont know what language it in except by looking at the characters themselves AND i also have to deal with stuff that contains symbols, funny punctuation etc > It is possible to construct a tokenizer just based on pure java code - there are several examples of this in Lucene 4.1, see e.g. PatternTokenizer, and CharTokenizer and its subclasses WhitespaceTokenizer and LetterTokenizer. > Ah yes I discovered this today, what I would really like is a version of the jflex StandardTokenizer but written in pure Java making it easier to tweak it, but I'm a little concerned that If I naively write it from scratch I may create something that doesnt perform very well. Paul --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org