Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 25695 invoked from network); 4 Aug 2009 15:43:18 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 4 Aug 2009 15:43:18 -0000 Received: (qmail 81480 invoked by uid 500); 4 Aug 2009 15:43:21 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 81396 invoked by uid 500); 4 Aug 2009 15:43:21 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 81386 invoked by uid 99); 4 Aug 2009 15:43:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Aug 2009 15:43:21 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of serera@gmail.com designates 209.85.219.226 as permitted sender) Received: from [209.85.219.226] (HELO mail-ew0-f226.google.com) (209.85.219.226) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Aug 2009 15:43:13 +0000 Received: by ewy26 with SMTP id 26so4993820ewy.5 for ; Tue, 04 Aug 2009 08:42:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=KqpQ9x6fL3Jf65KHqJWxC4T7/p5shKKFUA+vPIxU4cQ=; b=Y34/p+TtVooRlPiPg+N8FSkgkNpJkABotc8UcIzWdJHeFT/1PjfHuZVIj7NULaw7Pf q/TLLAKJLjb223diXsgRIrAhzUJZSGdE9y0Fw9uz4J28jOOyMnSgmXDEcY6OsjZyf063 k6ci/qFUzoHVTZeDUukL7GqCCZaw8wAgrXH9o= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=n1KKFtzPZTOKMZ5EZYOp0OrRg8xbbxGn+v1wYewQbHH3/ov2HRQdpA2toekjZ35pe8 c2U3TYVq6Kc7RzZc/p3jNOSUzAEcOoNSeD+KeffuDSdYQrTvhYeD9S1RKNWUFjNEuDTr GH0pMUwtvA57DV48ZOJjBfRrL4JkBSUwJF+2s= MIME-Version: 1.0 Received: by 10.216.87.71 with SMTP id x49mr1531466wee.11.1249400571803; Tue, 04 Aug 2009 08:42:51 -0700 (PDT) In-Reply-To: <9cafbc680908040837t5e113320p1f8a050848530610@mail.gmail.com> References: <24802552.post@talk.nabble.com> <786fde50908032354m656b2260u6f1fa44feee1987c@mail.gmail.com> <24803560.post@talk.nabble.com> <786fde50908040128p67c2ca6en3c23ad7550280a1b@mail.gmail.com> <24805609.post@talk.nabble.com> <786fde50908040356v15db635buf3063d5b7a45a5f1@mail.gmail.com> <9cafbc680908040819v17d76a27u7c256b7065c815bf@mail.gmail.com> <39397.38.103.17.250.1249399663.squirrel@webmail7.pair.com> <786fde50908040831x7fda78dcnf7b5136ad54c924e@mail.gmail.com> <9cafbc680908040837t5e113320p1f8a050848530610@mail.gmail.com> Date: Tue, 4 Aug 2009 18:42:51 +0300 Message-ID: <786fde50908040842m2e4a6034k8a61027265f65b48@mail.gmail.com> Subject: Re: Searching doubt From: Shai Erera To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0016e6d63f8a43ba33047052bd22 X-Virus-Checked: Checked by ClamAV on apache.org --0016e6d63f8a43ba33047052bd22 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Interesting ... I don't have access to a Japanese dictionary, so I just extract bi-grams. But I guess that in this case, if one can access an English dictionary (are you aware of an "open-source" one, or free one BTW?), one can use the method you mention. But still, doing this for every Token you meet is extremely expensive (for Japanese is all you can do, but this case is rather special), so I'd first make sure I can pinpoint the very small number of possible tokens I should process like that. Shai On Tue, Aug 4, 2009 at 6:37 PM, Phil Whelan wrote: > On Tue, Aug 4, 2009 at 8:31 AM, Shai Erera wrote: > > Hi Darren, > > > > The question was, how given a string "aboutus" in a document, you can > return > > that document as a result to the query "about us" (note the space). So > we're > > mostly discussing how to detect and then break the word "aboutus" to two > > words. > > When traversing Japanese text you have a use a similar algorithm to > searching a maze (keep left and retrace your steps). It's possible to > go a long way along sentence before you find the tokens you've already > picked out are invalid. Rough example... > > thereallibrary > there allibrary > there all i brary (fail) > the reallibrary > the real library > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --0016e6d63f8a43ba33047052bd22--