lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kainth, Sachin" <Sachin.Kai...@atkinsglobal.com>
Subject RE: Search in all fields
Date Tue, 20 Feb 2007 09:50:25 GMT
Hi Erick,

I'm not sure I fully understand this.  Would I be right in saying that
with this solution I would still need to create an "all" field which
contains duplicates of data already indexed.  Because if this is so then
we still have the problem of doubling the index size. 

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: 19 February 2007 19:02
To: java-user@lucene.apache.org
Subject: Re: Search in all fields

Sure. Convert your simple queries into span queries (which are also
relatively simple). Then, when you index everything in the "all" field,
subclass your analyzer to return a large PositionIncrementGap.
Explaining how this works with words is awkward, so....

doc.add("all", "one two three");
doc.add("all", "four five six");
doc.add("all", "seven eight nine");
index the document.

Assume you've implemented an analyzer that returns 1000 for
getPositionIncrementGap.

Now, the term offsets in the single document will be one - 0 two - 1
three - 2 four 1003 five 1004 six 1005 seven 2006 eight 2007 nine 2008

Now, if you use SpanNearQuery with a slop of 900 (i.e. "one nine"~900)
you won't get a match because the "distance" between one and nine is
more than 900. But "one three"~900 will match.

It's possible to transform any query into a set of span queries, See the
thread "Multiword Highlighting" that Mark Miller and I were exchanging
ideas on recently. Be aware that the code we were talking about has to
have a modification when used on a "regular" index where it pays
attention to the document that each sub-clause comes. The code, as
written, assumes you're using a MemoryIndex for one and only one
document, so unless you need complex queries, I'd just think about
rewriting simple queries with ANDs as a SpanNearQuery.

Best
Erick

On 2/19/07, Kainth, Sachin <Sachin.Kainth@atkinsglobal.com> wrote:
>
> Hi All,
>
> I want to be able to do a search for a term in all fields in a
document.
>
>
> One way this can be done is to put every element of a document in the 
> default field (or I guess any other single named field) as well as 
> separate fields in which those elements belong.  So for example if for

> my documents I had the following fields:
>
> A, B, C, D and E
>
> If I then set up a field called
>
> All
>
> And for all documents I processed as well as putting the elements of 
> that document in A, B, C, D and E I would also put them as a 
> concatenation into All as well.
>
> One problem with this is that if for a particular document I had these

> values for my five fields:
>
> A -> Hello
> B -> How
> C -> Are
> D -> You
> E -> Mate
> (All -> Hello How Are You Mate)
>
> Then a search for "How Are You" in All would return true when no 
> single field contains this string which is not ideal.
>
> Another problem with this is that it would double the size of the 
> index (unless Lucene does something clever here).
>
> A way to solve the original issue is to convert the search for "How 
> Are You" into this:
>
> A:How Are You OR B:How Are You OR C:How Are You OR D:How Are You OR 
> E:How Are You
>
> This solves both the problems of the solution where we set up the All 
> field (viz. increasing the size of the index and  bringing back more 
> results than we should).
>
> However, this solution also has it's drawback and that is that now we 
> have gone from a simple query to a complex ANDing of all fields in the

> document.
>
> My question is this: is there a third way?
>
> Cheers
>
> Sachin
>
>
> This email and any attached files are confidential and copyright 
> protected. If you are not the addressee, any dissemination of this 
> communication is strictly prohibited. Unless otherwise expressly 
> agreed in writing, nothing stated in this communication shall be
legally binding.
>
> The ultimate parent company of the Atkins Group is WS Atkins plc.  
> Registered in England No. 1885586.  Registered Office Woodcote Grove, 
> Ashley Road, Epsom, Surrey KT18 5BW.
>
> Consider the environment. Please don't print this e-mail unless you 
> really need to.
>


This message has been scanned for viruses by MailControl - (see
http://bluepages.wsatkins.co.uk/?6875772)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message