Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 16668 invoked from network); 4 Aug 2009 15:19:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 4 Aug 2009 15:19:56 -0000 Received: (qmail 27960 invoked by uid 500); 4 Aug 2009 15:19:58 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 27900 invoked by uid 500); 4 Aug 2009 15:19:58 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 27890 invoked by uid 99); 4 Aug 2009 15:19:58 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Aug 2009 15:19:58 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of phil123@gmail.com designates 209.85.211.188 as permitted sender) Received: from [209.85.211.188] (HELO mail-yw0-f188.google.com) (209.85.211.188) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Aug 2009 15:19:49 +0000 Received: by ywh26 with SMTP id 26so5269507ywh.5 for ; Tue, 04 Aug 2009 08:19:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=j67JFV57GCZuBimOelzjTQgZ+jxq6aVcM2UYljrnOZQ=; b=LPKCiHNK/56Hr0b0ihxkXSYgn3XpuanC8cv1rehT+8hWjXuj+vxYuwxtbVaqXXZLaL 7QiTcTVxQU1fONZ/bmZBSpva3MdlClb8s7ACb7CxP7MkPmqMGgUKYMweVIZgkcBmbRXa 0+fQBBwgVw2hAMmpq2ArkDLP6GG4bDW6JIQKM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=gkPP05xccFVThQtc9DOeEqXVP33N6LSMeitOT0TKkOAoxma2zdkquvg2IobpGQikcT beitsbPRVq5b/cfCo/GxenPMSGlsj1DL3acCi83RF8+mA8iaNunrwq83k4li1MVuHs0l RtBBbM3wY4cDTNUosDTqZc/QWB8q2jit68ROQ= MIME-Version: 1.0 Received: by 10.100.254.19 with SMTP id b19mr8803107ani.22.1249399147103; Tue, 04 Aug 2009 08:19:07 -0700 (PDT) In-Reply-To: <786fde50908040356v15db635buf3063d5b7a45a5f1@mail.gmail.com> References: <24802552.post@talk.nabble.com> <867513fe0908032320m61fd55f7qa6b40ca9c625a343@mail.gmail.com> <24803073.post@talk.nabble.com> <786fde50908032354m656b2260u6f1fa44feee1987c@mail.gmail.com> <24803560.post@talk.nabble.com> <786fde50908040128p67c2ca6en3c23ad7550280a1b@mail.gmail.com> <24805609.post@talk.nabble.com> <786fde50908040356v15db635buf3063d5b7a45a5f1@mail.gmail.com> Date: Tue, 4 Aug 2009 08:19:06 -0700 Message-ID: <9cafbc680908040819v17d76a27u7c256b7065c815bf@mail.gmail.com> Subject: Re: Searching doubt From: Phil Whelan To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On Tue, Aug 4, 2009 at 3:56 AM, Shai Erera wrote: > 2) Use a dictionary (real dictionary), and search it for every substring, > e.g. "a", "ab", "abo" ... "about" etc. If you find a match, split it there. > This needs some fine tuning, like checking if the rest is also a word and if > the full string is also a word, so that you don't break up meaningful words. > You'll need to get a dictionary for that. I do not have a solution to this, but it strikes me as very similar to they way you traverse Japanese to break words, since that has no spaces. Is there a Japanese tokenizer and, if so, does it handle this? If so, you could replace the Japanese dictionary with an English dictionary. Just a random thought had that might / might not help. Phil --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org