lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Searching doubt
Date Tue, 04 Aug 2009 19:55:25 GMT
I had suggested that in my first response, but I think Harig's problem is
that those words are not known in advance. Therefore, facing the query
"about us" and converting it to "aboutus" is simple, but what about queries
like "united states", or "united states of america"? Should they be
'grouped' together always? And if the query contains more than 2 words,
should every couple be grouped? That will lead to n^2 terms to process for
each query, which is very expensive.

The dictionary approach can work in both directions. I.e., query("about us")
--> query("aboutus", "about us") can work as well if someone creates a
mapping from "about us" to "aboutus". But this dictionary need to be built
in advance, or at least add terms to it as you encounter more cases.

Harig - I don't think there's a magic solution here. If you don't know what
to expect, you can only do a "best effort" solution. There were a couple of
suggestions proposed, and I think you should evaluate them and decide what
can work for you. If you prefer any particular solution over the others,
then let us know and we'll try to help further.

>From all the solutions proposed, my personal preference is to use a
dictionary and during indexing try to break those words according to
dictionary terms. If that's not possible (e.g. there isn't a dictionary you
can use), you can try to break the words into sub strings. But I'd
definitely would go for a solution on the indexing side, since usually we
prefer to pay more during indexing than during search.

BTW, another solution that pops into mind is to use an NGram Analyzer, that
will extract 'grams' of words. So the word "aboutus" will be converted
(using N=3) to: "abo", "bou", "out", "utu", "tus". Then the query "about us"
will be converted (using N=3 again) to: "abo", "bou", "out", "us" (assuming
you split words on whitespace) and this document will match. The downside of
this is that other documents will match too (e.g. documents with the word
"out"), but you can improve the scoring by running the following query
"about us \"about\" \"us\"". Notice that the last two words are put in a
phrase. So docs that contain the word "about" will be scored higher than
docs that contain just the word "out". Also, notice that with this approach,
the default search operator is recommended to be OR, as requesting AND on
NGrams may hurt recall severely.

Shai

On Tue, Aug 4, 2009 at 7:34 PM, N Hira <nhira@cognocys.com> wrote:

>
> Good summary, Shai.
>
> I've missed some of this thread as well, but does anyone know what happened
> to the suggestion about query manipulation?
>
> e.g., query (about us) => query("about us", "aboutus")
>    query(credit card) => query("credit card", "creditcard")
>
> Regards,
>
> -h
>
>
>
> ----- Original Message ----
> From: Shai Erera <serera@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Tuesday, August 4, 2009 10:31:46 AM
> Subject: Re: Searching doubt
>
> Hi Darren,
>
> The question was, how given a string "aboutus" in a document, you can
> return
> that document as a result to the query "about us" (note the space). So
> we're
> mostly discussing how to detect and then break the word "aboutus" to two
> words.
>
> What you wrote though seems interesting as well, only I think not related
> to
> Harig's original question. Maybe he'll be interested in that too though.
>
> Shai
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message