Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 20342 invoked from network); 4 Aug 2009 15:28:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 4 Aug 2009 15:28:39 -0000 Received: (qmail 46832 invoked by uid 500); 4 Aug 2009 15:28:42 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 46751 invoked by uid 500); 4 Aug 2009 15:28:42 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 46741 invoked by uid 99); 4 Aug 2009 15:28:42 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Aug 2009 15:28:42 +0000 X-ASF-Spam-Status: No, hits=0.2 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.68.4.129] (HELO wbm7.pair.net) (209.68.4.129) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Aug 2009 15:28:33 +0000 Received: by wbm7.pair.net (Postfix, from userid 65534) id 436E610520; Tue, 4 Aug 2009 11:27:44 -0400 (EDT) Received: from 38.103.17.250 ([38.103.17.250]) (SquirrelMail authenticated user darren@ontrenet.com) by webmail7.pair.com with HTTP; Tue, 4 Aug 2009 11:27:43 -0400 (EDT) Message-ID: <39397.38.103.17.250.1249399663.squirrel@webmail7.pair.com> In-Reply-To: <9cafbc680908040819v17d76a27u7c256b7065c815bf@mail.gmail.com> References: <24802552.post@talk.nabble.com> <867513fe0908032320m61fd55f7qa6b40ca9c625a343@mail.gmail.com> <24803073.post@talk.nabble.com> <786fde50908032354m656b2260u6f1fa44feee1987c@mail.gmail.com> <24803560.post@talk.nabble.com> <786fde50908040128p67c2ca6en3c23ad7550280a1b@mail.gmail.com> <24805609.post@talk.nabble.com> <786fde50908040356v15db635buf3063d5b7a45a5f1@mail.gmail.com> <9cafbc680908040819v17d76a27u7c256b7065c815bf@mail.gmail.com> Date: Tue, 4 Aug 2009 11:27:43 -0400 (EDT) Subject: Re: Searching doubt From: darren@ontrenet.com To: java-user@lucene.apache.org User-Agent: SquirrelMail/1.4.5 MIME-Version: 1.0 Content-Type: text/plain;charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Priority: 3 (Normal) Importance: Normal X-Virus-Checked: Checked by ClamAV on apache.org Just catching this thread, but if I understand what is being asked I can share how I do multi-word phrase matching. If that's not what's wanted, pardons! Ok, I load an entire dictionary into a lucene index, phrases and all. When I'm scanning some text, I do lookups in this dictionary index using one word at a time with the word _at the beginning_ of the indexed field only. This returns all words/phrases beginning with the word I searched for. I then scan the rest of the input text and compare it to the longest matching phrase in my lucene results. That then becomes a meaningful token. Input text: "The President of the United States lives in the White House" Tokens: "The" "President of the United States" "lives" "in" "the" "White House" Term: "President" Result: "President of a Company" "President" "President of the United States" Take the longest match. HTH, Darren > On Tue, Aug 4, 2009 at 3:56 AM, Shai Erera wrote: >> 2) Use a dictionary (real dictionary), and search it for every >> substring, >> e.g. "a", "ab", "abo" ... "about" etc. If you find a match, split it >> there. >> This needs some fine tuning, like checking if the rest is also a word >> and if >> the full string is also a word, so that you don't break up meaningful >> words. >> You'll need to get a dictionary for that. > > I do not have a solution to this, but it strikes me as very similar to > they way you traverse Japanese to break words, since that has no > spaces. Is there a Japanese tokenizer and, if so, does it handle this? > If so, you could replace the Japanese dictionary with an English > dictionary. Just a random thought had that might / might not help. > > Phil > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org