Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 76778 invoked from network); 19 Feb 2007 19:02:03 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 19 Feb 2007 19:02:03 -0000 Received: (qmail 27598 invoked by uid 500); 19 Feb 2007 19:02:05 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 27548 invoked by uid 500); 19 Feb 2007 19:02:04 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 27537 invoked by uid 99); 19 Feb 2007 19:02:04 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Feb 2007 11:02:04 -0800 X-ASF-Spam-Status: No, hits=2.9 required=10.0 tests=HTML_10_20,HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of erickerickson@gmail.com designates 64.233.182.185 as permitted sender) Received: from [64.233.182.185] (HELO nf-out-0910.google.com) (64.233.182.185) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Feb 2007 11:01:54 -0800 Received: by nf-out-0910.google.com with SMTP id i2so2284907nfe for ; Mon, 19 Feb 2007 11:01:32 -0800 (PST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=cIkkXYQSynviUGkwLU/A9iCkToptXXPWK5LndCIHOPs4KujGCiBAmMpc4mJKcMJVTeYfxNA0CQKeJqHNFKmyLuNTMhumf90p6uTbLk3chKkRVUXyVNZXBkYz1TTX/ZtM9kGaA6wMTgVEMS9Z4vt2xMD574UNlmGbdCeT/ocVrqU= Received: by 10.82.172.15 with SMTP id u15mr11442388bue.1171911691987; Mon, 19 Feb 2007 11:01:31 -0800 (PST) Received: by 10.82.162.20 with HTTP; Mon, 19 Feb 2007 11:01:31 -0800 (PST) Message-ID: <359a92830702191101h15357498h20aea6458118c0bc@mail.gmail.com> Date: Mon, 19 Feb 2007 14:01:31 -0500 From: "Erick Erickson" To: java-user@lucene.apache.org Subject: Re: Search in all fields In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_41622_4177063.1171911691741" References: X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_41622_4177063.1171911691741 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline Sure. Convert your simple queries into span queries (which are also relatively simple). Then, when you index everything in the "all" field, subclass your analyzer to return a large PositionIncrementGap. Explaining how this works with words is awkward, so.... doc.add("all", "one two three"); doc.add("all", "four five six"); doc.add("all", "seven eight nine"); index the document. Assume you've implemented an analyzer that returns 1000 for getPositionIncrementGap. Now, the term offsets in the single document will be one - 0 two - 1 three - 2 four 1003 five 1004 six 1005 seven 2006 eight 2007 nine 2008 Now, if you use SpanNearQuery with a slop of 900 (i.e. "one nine"~900) you won't get a match because the "distance" between one and nine is more than 900. But "one three"~900 will match. It's possible to transform any query into a set of span queries, See the thread "Multiword Highlighting" that Mark Miller and I were exchanging ideas on recently. Be aware that the code we were talking about has to have a modification when used on a "regular" index where it pays attention to the document that each sub-clause comes. The code, as written, assumes you're using a MemoryIndex for one and only one document, so unless you need complex queries, I'd just think about rewriting simple queries with ANDs as a SpanNearQuery. Best Erick On 2/19/07, Kainth, Sachin wrote: > > Hi All, > > I want to be able to do a search for a term in all fields in a document. > > > One way this can be done is to put every element of a document in the > default field (or I guess any other single named field) as well as > separate fields in which those elements belong. So for example if for > my documents I had the following fields: > > A, B, C, D and E > > If I then set up a field called > > All > > And for all documents I processed as well as putting the elements of > that document in A, B, C, D and E I would also put them as a > concatenation into All as well. > > One problem with this is that if for a particular document I had these > values for my five fields: > > A -> Hello > B -> How > C -> Are > D -> You > E -> Mate > (All -> Hello How Are You Mate) > > Then a search for "How Are You" in All would return true when no single > field contains this string which is not ideal. > > Another problem with this is that it would double the size of the index > (unless Lucene does something clever here). > > A way to solve the original issue is to convert the search for "How Are > You" into this: > > A:How Are You OR B:How Are You OR C:How Are You OR D:How Are You OR > E:How Are You > > This solves both the problems of the solution where we set up the All > field (viz. increasing the size of the index and bringing back more > results than we should). > > However, this solution also has it's drawback and that is that now we > have gone from a simple query to a complex ANDing of all fields in the > document. > > My question is this: is there a third way? > > Cheers > > Sachin > > > This email and any attached files are confidential and copyright > protected. If you are not the addressee, any dissemination of this > communication is strictly prohibited. Unless otherwise expressly agreed in > writing, nothing stated in this communication shall be legally binding. > > The ultimate parent company of the Atkins Group is WS Atkins > plc. Registered in England No. 1885586. Registered Office Woodcote Grove, > Ashley Road, Epsom, Surrey KT18 5BW. > > Consider the environment. Please don't print this e-mail unless you really > need to. > ------=_Part_41622_4177063.1171911691741--