Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 22946 invoked from network); 4 Aug 2009 15:34:15 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 4 Aug 2009 15:34:15 -0000 Received: (qmail 61565 invoked by uid 500); 4 Aug 2009 15:34:18 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 61495 invoked by uid 500); 4 Aug 2009 15:34:17 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 61485 invoked by uid 99); 4 Aug 2009 15:34:17 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Aug 2009 15:34:17 +0000 X-ASF-Spam-Status: No, hits=0.2 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.68.4.129] (HELO wbm7.pair.net) (209.68.4.129) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Aug 2009 15:34:08 +0000 Received: by wbm7.pair.net (Postfix, from userid 65534) id 2C13710520; Tue, 4 Aug 2009 11:33:46 -0400 (EDT) Received: from 38.103.17.250 ([38.103.17.250]) (SquirrelMail authenticated user darren@ontrenet.com) by webmail7.pair.com with HTTP; Tue, 4 Aug 2009 11:33:46 -0400 (EDT) Message-ID: <55198.38.103.17.250.1249400026.squirrel@webmail7.pair.com> In-Reply-To: <786fde50908040831x7fda78dcnf7b5136ad54c924e@mail.gmail.com> References: <24802552.post@talk.nabble.com> <867513fe0908032320m61fd55f7qa6b40ca9c625a343@mail.gmail.com> <24803073.post@talk.nabble.com> <786fde50908032354m656b2260u6f1fa44feee1987c@mail.gmail.com> <24803560.post@talk.nabble.com> <786fde50908040128p67c2ca6en3c23ad7550280a1b@mail.gmail.com> <24805609.post@talk.nabble.com> <786fde50908040356v15db635buf3063d5b7a45a5f1@mail.gmail.com> <9cafbc680908040819v17d76a27u7c256b7065c815bf@mail.gmail.com> <39397.38.103.17.250.1249399663.squirrel@webmail7.pair.com> <786fde50908040831x7fda78dcnf7b5136ad54c924e@mail.gmail.com> Date: Tue, 4 Aug 2009 11:33:46 -0400 (EDT) Subject: Re: Searching doubt From: darren@ontrenet.com To: java-user@lucene.apache.org Cc: java-user@lucene.apache.org User-Agent: SquirrelMail/1.4.5 MIME-Version: 1.0 Content-Type: text/plain;charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Priority: 3 (Normal) Importance: Normal X-Virus-Checked: Checked by ClamAV on apache.org Ahhhh, ok. Interesting problem there as well. I'll think on that one some too! cheers. > Hi Darren, > > The question was, how given a string "aboutus" in a document, you can > return > that document as a result to the query "about us" (note the space). So > we're > mostly discussing how to detect and then break the word "aboutus" to two > words. > > What you wrote though seems interesting as well, only I think not related > to > Harig's original question. Maybe he'll be interested in that too though. > > Shai > > On Tue, Aug 4, 2009 at 6:27 PM, wrote: > >> Just catching this thread, but if I understand what is being asked I can >> share how I do multi-word phrase matching. If that's not what's wanted, >> pardons! >> >> Ok, I load an entire dictionary into a lucene index, phrases and all. >> >> When I'm scanning some text, I do lookups in this dictionary index using >> one word at a time with the word _at the beginning_ of the indexed field >> only. This returns all words/phrases beginning with the word I searched >> for. >> >> I then scan the rest of the input text and compare it to the longest >> matching phrase in my lucene results. That then becomes a meaningful >> token. >> >> Input text: >> "The President of the United States lives in the White House" >> >> Tokens: >> "The" >> "President of the United States" >> "lives" >> "in" >> "the" >> "White House" >> >> Term: "President" >> Result: >> "President of a Company" >> "President" >> "President of the United States" >> >> Take the longest match. >> >> HTH, >> Darren >> >> >> >> > On Tue, Aug 4, 2009 at 3:56 AM, Shai Erera wrote: >> >> 2) Use a dictionary (real dictionary), and search it for every >> >> substring, >> >> e.g. "a", "ab", "abo" ... "about" etc. If you find a match, split it >> >> there. >> >> This needs some fine tuning, like checking if the rest is also a word >> >> and if >> >> the full string is also a word, so that you don't break up meaningful >> >> words. >> >> You'll need to get a dictionary for that. >> > >> > I do not have a solution to this, but it strikes me as very similar to >> > they way you traverse Japanese to break words, since that has no >> > spaces. Is there a Japanese tokenizer and, if so, does it handle this? >> > If so, you could replace the Japanese dictionary with an English >> > dictionary. Just a random thought had that might / might not help. >> > >> > Phil >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> > For additional commands, e-mail: java-user-help@lucene.apache.org >> > >> > >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org