lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Search in all fields
Date Mon, 19 Feb 2007 19:01:31 GMT
Sure. Convert your simple queries into span queries (which are also
relatively simple). Then, when you index everything in the "all" field,
subclass your analyzer to return a large PositionIncrementGap. Explaining
how this works with words is awkward, so....

doc.add("all", "one two three");
doc.add("all", "four five six");
doc.add("all", "seven eight nine");
index the document.

Assume you've implemented an analyzer that returns 1000 for
getPositionIncrementGap.

Now, the term offsets in the single document will be
one - 0
two - 1
three - 2
four 1003
five 1004
six 1005
seven 2006
eight 2007
nine 2008

Now, if you use SpanNearQuery with a slop of 900 (i.e. "one nine"~900) you
won't get a match because the "distance" between one and nine is more than
900. But "one three"~900 will match.

It's possible to transform any query into a set of span queries, See the
thread "Multiword Highlighting" that Mark Miller and I were exchanging ideas
on recently. Be aware that the code we were talking about has to have a
modification when used on a "regular" index where it pays attention to the
document that each sub-clause comes. The code, as written, assumes you're
using a MemoryIndex for one and only one document, so unless you need
complex queries, I'd just think about rewriting simple queries with ANDs as
a SpanNearQuery.

Best
Erick

On 2/19/07, Kainth, Sachin <Sachin.Kainth@atkinsglobal.com> wrote:
>
> Hi All,
>
> I want to be able to do a search for a term in all fields in a document.
>
>
> One way this can be done is to put every element of a document in the
> default field (or I guess any other single named field) as well as
> separate fields in which those elements belong.  So for example if for
> my documents I had the following fields:
>
> A, B, C, D and E
>
> If I then set up a field called
>
> All
>
> And for all documents I processed as well as putting the elements of
> that document in A, B, C, D and E I would also put them as a
> concatenation into All as well.
>
> One problem with this is that if for a particular document I had these
> values for my five fields:
>
> A -> Hello
> B -> How
> C -> Are
> D -> You
> E -> Mate
> (All -> Hello How Are You Mate)
>
> Then a search for "How Are You" in All would return true when no single
> field contains this string which is not ideal.
>
> Another problem with this is that it would double the size of the index
> (unless Lucene does something clever here).
>
> A way to solve the original issue is to convert the search for "How Are
> You" into this:
>
> A:How Are You OR B:How Are You OR C:How Are You OR D:How Are You OR
> E:How Are You
>
> This solves both the problems of the solution where we set up the All
> field (viz. increasing the size of the index and  bringing back more
> results than we should).
>
> However, this solution also has it's drawback and that is that now we
> have gone from a simple query to a complex ANDing of all fields in the
> document.
>
> My question is this: is there a third way?
>
> Cheers
>
> Sachin
>
>
> This email and any attached files are confidential and copyright
> protected. If you are not the addressee, any dissemination of this
> communication is strictly prohibited. Unless otherwise expressly agreed in
> writing, nothing stated in this communication shall be legally binding.
>
> The ultimate parent company of the Atkins Group is WS Atkins
> plc.  Registered in England No. 1885586.  Registered Office Woodcote Grove,
> Ashley Road, Epsom, Surrey KT18 5BW.
>
> Consider the environment. Please don't print this e-mail unless you really
> need to.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message