Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Message-ID: <55198.38.103.17.250.1249400026.squirrel@webmail7.pair.com>
In-Reply-To: <786fde50908040831x7fda78dcnf7b5136ad54c924e@mail.gmail.com>
References: <24802552.post@talk.nabble.com>
    <867513fe0908032320m61fd55f7qa6b40ca9c625a343@mail.gmail.com>
    <24803073.post@talk.nabble.com>
    <786fde50908032354m656b2260u6f1fa44feee1987c@mail.gmail.com>
    <24803560.post@talk.nabble.com>
    <786fde50908040128p67c2ca6en3c23ad7550280a1b@mail.gmail.com>
    <24805609.post@talk.nabble.com>
    <786fde50908040356v15db635buf3063d5b7a45a5f1@mail.gmail.com>
    <9cafbc680908040819v17d76a27u7c256b7065c815bf@mail.gmail.com>
    <39397.38.103.17.250.1249399663.squirrel@webmail7.pair.com>
    <786fde50908040831x7fda78dcnf7b5136ad54c924e@mail.gmail.com>
Date: Tue, 4 Aug 2009 11:33:46 -0400 (EDT)
Subject: Re: Searching doubt
From: darren@ontrenet.com
To: java-user@lucene.apache.org
Cc: java-user@lucene.apache.org
User-Agent: SquirrelMail/1.4.5
MIME-Version: 1.0
Content-Type: text/plain;charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Importance: Normal

Ahhhh, ok. Interesting problem there as well.

I'll think on that one some too!

cheers.

> Hi Darren,
>
> The question was, how given a string "aboutus" in a document, you can
> return
> that document as a result to the query "about us" (note the space). So
> we're
> mostly discussing how to detect and then break the word "aboutus" to two
> words.
>
> What you wrote though seems interesting as well, only I think not related
> to
> Harig's original question. Maybe he'll be interested in that too though.
>
> Shai
>
> On Tue, Aug 4, 2009 at 6:27 PM, <darren@ontrenet.com> wrote:
>
>> Just catching this thread, but if I understand what is being asked I can
>> share how I do multi-word phrase matching. If that's not what's wanted,
>> pardons!
>>
>> Ok, I load an entire dictionary into a lucene index, phrases and all.
>>
>> When I'm scanning some text, I do lookups in this dictionary index using
>> one word at a time with the word _at the beginning_ of the indexed field
>> only. This returns all words/phrases beginning with the word I searched
>> for.
>>
>> I then scan the rest of the input text and compare it to the longest
>> matching phrase in my lucene results. That then becomes a meaningful
>> token.
>>
>> Input text:
>> "The President of the United States lives in the White House"
>>
>> Tokens:
>> "The"
>> "President of the United States"
>> "lives"
>> "in"
>> "the"
>> "White House"
>>
>> Term: "President"
>> Result:
>> "President of a Company"
>> "President"
>> "President of the United States"
>>
>> Take the longest match.
>>
>> HTH,
>> Darren
>>
>>
>>
>> > On Tue, Aug 4, 2009 at 3:56 AM, Shai Erera<serera@gmail.com> wrote:
>> >> 2) Use a dictionary (real dictionary), and search it for every
>> >> substring,
>> >> e.g. "a", "ab", "abo" ... "about" etc. If you find a match, split it
>> >> there.
>> >> This needs some fine tuning, like checking if the rest is also a word
>> >> and if
>> >> the full string is also a word, so that you don't break up meaningful
>> >> words.
>> >> You'll need to get a dictionary for that.
>> >
>> > I do not have a solution to this, but it strikes me as very similar to
>> > they way you traverse Japanese to break words, since that has no
>> > spaces. Is there a Japanese tokenizer and, if so, does it handle this?
>> > If so, you could replace the Japanese dictionary with an English
>> > dictionary. Just a random thought had that might / might not help.
>> >
>> > Phil
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org