Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of serera@gmail.com designates
 209.85.219.226 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=n1KKFtzPZTOKMZ5EZYOp0OrRg8xbbxGn+v1wYewQbHH3/ov2HRQdpA2toekjZ35pe8
         c2U3TYVq6Kc7RzZc/p3jNOSUzAEcOoNSeD+KeffuDSdYQrTvhYeD9S1RKNWUFjNEuDTr
         GH0pMUwtvA57DV48ZOJjBfRrL4JkBSUwJF+2s=
MIME-Version: 1.0
In-Reply-To: <9cafbc680908040837t5e113320p1f8a050848530610@mail.gmail.com>
References: <24802552.post@talk.nabble.com>
	 <786fde50908032354m656b2260u6f1fa44feee1987c@mail.gmail.com>
	 <24803560.post@talk.nabble.com>
	 <786fde50908040128p67c2ca6en3c23ad7550280a1b@mail.gmail.com>
	 <24805609.post@talk.nabble.com>
	 <786fde50908040356v15db635buf3063d5b7a45a5f1@mail.gmail.com>
	 <9cafbc680908040819v17d76a27u7c256b7065c815bf@mail.gmail.com>
	 <39397.38.103.17.250.1249399663.squirrel@webmail7.pair.com>
	 <786fde50908040831x7fda78dcnf7b5136ad54c924e@mail.gmail.com>
	 <9cafbc680908040837t5e113320p1f8a050848530610@mail.gmail.com>
Date: Tue, 4 Aug 2009 18:42:51 +0300
Message-ID: <786fde50908040842m2e4a6034k8a61027265f65b48@mail.gmail.com>
Subject: Re: Searching doubt
From: Shai Erera <serera@gmail.com>
To: java-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=0016e6d63f8a43ba33047052bd22

--0016e6d63f8a43ba33047052bd22
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Interesting ... I don't have access to a Japanese dictionary, so I just
extract bi-grams. But I guess that in this case, if one can access an
English dictionary (are you aware of an "open-source" one, or free one
BTW?), one can use the method you mention.

But still, doing this for every Token you meet is extremely expensive (for
Japanese is all you can do, but this case is rather special), so I'd first
make sure I can pinpoint the very small number of possible tokens I should
process like that.

Shai

On Tue, Aug 4, 2009 at 6:37 PM, Phil Whelan <phil123@gmail.com> wrote:

> On Tue, Aug 4, 2009 at 8:31 AM, Shai Erera<serera@gmail.com> wrote:
> > Hi Darren,
> >
> > The question was, how given a string "aboutus" in a document, you can
> return
> > that document as a result to the query "about us" (note the space). So
> we're
> > mostly discussing how to detect and then break the word "aboutus" to two
> > words.
>
> When traversing Japanese text you have a use a similar algorithm to
> searching a maze (keep left and retrace your steps). It's possible to
> go a long way along sentence before you find the tokens you've already
> picked out are invalid. Rough example...
>
> thereallibrary
> there allibrary
> there all i brary (fail)
> the reallibrary
> the real library
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

--0016e6d63f8a43ba33047052bd22--