Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (herse.apache.org: domain of erickerickson@gmail.com
 designates 64.233.182.185 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=beta;
        h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references;
        b=cIkkXYQSynviUGkwLU/A9iCkToptXXPWK5LndCIHOPs4KujGCiBAmMpc4mJKcMJVTeYfxNA0CQKeJqHNFKmyLuNTMhumf90p6uTbLk3chKkRVUXyVNZXBkYz1TTX/ZtM9kGaA6wMTgVEMS9Z4vt2xMD574UNlmGbdCeT/ocVrqU=
Message-ID: <359a92830702191101h15357498h20aea6458118c0bc@mail.gmail.com>
Date: Mon, 19 Feb 2007 14:01:31 -0500
From: "Erick Erickson" <erickerickson@gmail.com>
To: java-user@lucene.apache.org
Subject: Re: Search in all fields
In-Reply-To: 
 <BC1CFC544422CB4A934908E126A4742705DCD3F3@SGBLOW2101.wsatkins.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_41622_4177063.1171911691741"
References: <BC1CFC544422CB4A934908E126A4742705DCD3F3@SGBLOW2101.wsatkins.com>

------=_Part_41622_4177063.1171911691741
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Sure. Convert your simple queries into span queries (which are also
relatively simple). Then, when you index everything in the "all" field,
subclass your analyzer to return a large PositionIncrementGap. Explaining
how this works with words is awkward, so....

doc.add("all", "one two three");
doc.add("all", "four five six");
doc.add("all", "seven eight nine");
index the document.

Assume you've implemented an analyzer that returns 1000 for
getPositionIncrementGap.

Now, the term offsets in the single document will be
one - 0
two - 1
three - 2
four 1003
five 1004
six 1005
seven 2006
eight 2007
nine 2008

Now, if you use SpanNearQuery with a slop of 900 (i.e. "one nine"~900) you
won't get a match because the "distance" between one and nine is more than
900. But "one three"~900 will match.

It's possible to transform any query into a set of span queries, See the
thread "Multiword Highlighting" that Mark Miller and I were exchanging ideas
on recently. Be aware that the code we were talking about has to have a
modification when used on a "regular" index where it pays attention to the
document that each sub-clause comes. The code, as written, assumes you're
using a MemoryIndex for one and only one document, so unless you need
complex queries, I'd just think about rewriting simple queries with ANDs as
a SpanNearQuery.

Best
Erick

On 2/19/07, Kainth, Sachin <Sachin.Kainth@atkinsglobal.com> wrote:
>
> Hi All,
>
> I want to be able to do a search for a term in all fields in a document.
>
>
> One way this can be done is to put every element of a document in the
> default field (or I guess any other single named field) as well as
> separate fields in which those elements belong.  So for example if for
> my documents I had the following fields:
>
> A, B, C, D and E
>
> If I then set up a field called
>
> All
>
> And for all documents I processed as well as putting the elements of
> that document in A, B, C, D and E I would also put them as a
> concatenation into All as well.
>
> One problem with this is that if for a particular document I had these
> values for my five fields:
>
> A -> Hello
> B -> How
> C -> Are
> D -> You
> E -> Mate
> (All -> Hello How Are You Mate)
>
> Then a search for "How Are You" in All would return true when no single
> field contains this string which is not ideal.
>
> Another problem with this is that it would double the size of the index
> (unless Lucene does something clever here).
>
> A way to solve the original issue is to convert the search for "How Are
> You" into this:
>
> A:How Are You OR B:How Are You OR C:How Are You OR D:How Are You OR
> E:How Are You
>
> This solves both the problems of the solution where we set up the All
> field (viz. increasing the size of the index and  bringing back more
> results than we should).
>
> However, this solution also has it's drawback and that is that now we
> have gone from a simple query to a complex ANDing of all fields in the
> document.
>
> My question is this: is there a third way?
>
> Cheers
>
> Sachin
>
>
> This email and any attached files are confidential and copyright
> protected. If you are not the addressee, any dissemination of this
> communication is strictly prohibited. Unless otherwise expressly agreed in
> writing, nothing stated in this communication shall be legally binding.
>
> The ultimate parent company of the Atkins Group is WS Atkins
> plc.  Registered in England No. 1885586.  Registered Office Woodcote Grove,
> Ashley Road, Epsom, Surrey KT18 5BW.
>
> Consider the environment. Please don't print this e-mail unless you really
> need to.
>

------=_Part_41622_4177063.1171911691741--