lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Search in all fields
Date Tue, 20 Feb 2007 13:21:22 GMT
Yes, you've got it right. But this the classic trade-off, space for speed
and what you choose to do reflects your particular situation. Doubling the
index size isn't to be done lightly, but how big is your index? If its 500M,
doubling it to 1G isn't a problem, Lucene handles this quite nicely. I'm
working with an 8G index now. If your index is 10G, well, that's something
else again.

Really, about all you can do is try and measure in your particular
situation, because you don't know ahead of time whether complex queries over
a smaller index performs better or worse than simple queries over a large
index. It all depends upon too many variables to allow anyone to provide an
abstract answer.

And, the performance question is not "is it as fast as it can be", it's "Is
it good enough for our users", a point that often gets lost in efficiency
discussions.

Best
Erick

On 2/20/07, Kainth, Sachin <Sachin.Kainth@atkinsglobal.com> wrote:
>
> Hi Erick,
>
> I'm not sure I fully understand this.  Would I be right in saying that
> with this solution I would still need to create an "all" field which
> contains duplicates of data already indexed.  Because if this is so then
> we still have the problem of doubling the index size.
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: 19 February 2007 19:02
> To: java-user@lucene.apache.org
> Subject: Re: Search in all fields
>
> Sure. Convert your simple queries into span queries (which are also
> relatively simple). Then, when you index everything in the "all" field,
> subclass your analyzer to return a large PositionIncrementGap.
> Explaining how this works with words is awkward, so....
>
> doc.add("all", "one two three");
> doc.add("all", "four five six");
> doc.add("all", "seven eight nine");
> index the document.
>
> Assume you've implemented an analyzer that returns 1000 for
> getPositionIncrementGap.
>
> Now, the term offsets in the single document will be one - 0 two - 1
> three - 2 four 1003 five 1004 six 1005 seven 2006 eight 2007 nine 2008
>
> Now, if you use SpanNearQuery with a slop of 900 (i.e. "one nine"~900)
> you won't get a match because the "distance" between one and nine is
> more than 900. But "one three"~900 will match.
>
> It's possible to transform any query into a set of span queries, See the
> thread "Multiword Highlighting" that Mark Miller and I were exchanging
> ideas on recently. Be aware that the code we were talking about has to
> have a modification when used on a "regular" index where it pays
> attention to the document that each sub-clause comes. The code, as
> written, assumes you're using a MemoryIndex for one and only one
> document, so unless you need complex queries, I'd just think about
> rewriting simple queries with ANDs as a SpanNearQuery.
>
> Best
> Erick
>
> On 2/19/07, Kainth, Sachin <Sachin.Kainth@atkinsglobal.com> wrote:
> >
> > Hi All,
> >
> > I want to be able to do a search for a term in all fields in a
> document.
> >
> >
> > One way this can be done is to put every element of a document in the
> > default field (or I guess any other single named field) as well as
> > separate fields in which those elements belong.  So for example if for
>
> > my documents I had the following fields:
> >
> > A, B, C, D and E
> >
> > If I then set up a field called
> >
> > All
> >
> > And for all documents I processed as well as putting the elements of
> > that document in A, B, C, D and E I would also put them as a
> > concatenation into All as well.
> >
> > One problem with this is that if for a particular document I had these
>
> > values for my five fields:
> >
> > A -> Hello
> > B -> How
> > C -> Are
> > D -> You
> > E -> Mate
> > (All -> Hello How Are You Mate)
> >
> > Then a search for "How Are You" in All would return true when no
> > single field contains this string which is not ideal.
> >
> > Another problem with this is that it would double the size of the
> > index (unless Lucene does something clever here).
> >
> > A way to solve the original issue is to convert the search for "How
> > Are You" into this:
> >
> > A:How Are You OR B:How Are You OR C:How Are You OR D:How Are You OR
> > E:How Are You
> >
> > This solves both the problems of the solution where we set up the All
> > field (viz. increasing the size of the index and  bringing back more
> > results than we should).
> >
> > However, this solution also has it's drawback and that is that now we
> > have gone from a simple query to a complex ANDing of all fields in the
>
> > document.
> >
> > My question is this: is there a third way?
> >
> > Cheers
> >
> > Sachin
> >
> >
> > This email and any attached files are confidential and copyright
> > protected. If you are not the addressee, any dissemination of this
> > communication is strictly prohibited. Unless otherwise expressly
> > agreed in writing, nothing stated in this communication shall be
> legally binding.
> >
> > The ultimate parent company of the Atkins Group is WS Atkins plc.
> > Registered in England No. 1885586.  Registered Office Woodcote Grove,
> > Ashley Road, Epsom, Surrey KT18 5BW.
> >
> > Consider the environment. Please don't print this e-mail unless you
> > really need to.
> >
>
>
> This message has been scanned for viruses by MailControl - (see
> http://bluepages.wsatkins.co.uk/?6875772)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message