lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Parkes" <steven_par...@esseff.org>
Subject RE: Soliciting Design Thoughts on Date Searching
Date Thu, 01 Mar 2007 22:38:01 GMT
If all you want to do is find docs containing dates within a range, it
probably doesn't make much difference whether you give dates their own
field or put them into your content field. It'll probably be easier to
just add them into the token stream since that's the way the analyzer
architecture wants to work (analyzers generally don't know anything
about fields.) You can make the position increment work if you want, and
it'll make phrase/span queries work better, if you need those to work.

What is going to matter in either case is how you format dates.
Everything in Lucene is text, so if you want to do date ranges (which
you mentioned in your first e-mail), you need to be careful how you
format the dates and what kinds of queries you use. See, for example,
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/DateTo
ols.html
(tinyurl: http://tinyurl.com/ejlvx)
and
http://wiki.apache.org/jakarta-lucene/LargeScaleDateRangeProcessing
(tinyurl: http://tinyurl.com/2pubaq)
There are also date filters (as opposed to date queries) that have
different tradeoffs.

Dates are kinda tricky in Lucene.

-----Original Message-----
From: Walt Stoneburner [mailto:walt.stoneburner@gmail.com] 
Sent: Thursday, March 01, 2007 7:54 AM
To: java-user@lucene.apache.org
Subject: Re: Soliciting Design Thoughts on Date Searching

Thank you all for the suggestions steering me down the right path.

As an aside, the easy part, at least for me, is extracting the dates
-- Peter was dead on about how doing that: heuristics, multiple
regular expressions, and data structures.  As Steve pointed out, this
isn't as trivial as it sounds - there are a lot of formats, some
ambiguous.

I love writing parsers (guess I'm sick in the head, eh?), so getting
the data isn't the problem, it's knowing what format to convert it
into and how to hand it to Lucene in a way that it'll find meaningful
for searching.

I had pondered making a single field with a value like:
document.add( Field.Text( "dates", "27-Feb-1968,04-Jul-1776,01-Mar-2007"
));
...but I wasn't convinced that the Lucene date Range was going to work
on anything other than a Date type, rather than a string of text that
just coincidently happened to contain dates.

Drawing back on my title example, I was under the incorrect impression
that if I had a field and provided another value that it replaced the
prior value.  Hoss is indicating this is not so, and that I'm safe
adding additional values.
document.add( Field.Text( "title", "Thanks Thomas" ));
document.add( Field.Text( "title", "Thanks Hoss" ) );  // Does not
stomp on Thomas. Yay!

If I can use this technique to pile in a ton of dates, then I'm
totally happy, you guys have pointed me in the right direction;
celebrations all around.

The brain scratcher, for me, was Peter's treating the dates like a
synonym -- a clever way of looking at the problem.  Unfortunately,
that'd be giving me too much credit, as I haven't played with that
feature set of Lucene.  So, without trying to, Peter's sent me
scrambling back to the API for something I wasn't aware was there.

Steve adds to the mystery by suggesting a delimited field list, much
like the example at the top of this message, and likewise doing some
trickery with the token stream and a position increment of zero --
again, a clever solution, and likewise beyond my limited Lucene
experience.

While I know, intellectually, that Lucene is digesting positioned
tokens, it is so well designed that fools like me can legitimately use
Lucene for long periods of time without actually being exposed to
what's happening under the hood.

The ponderance I now contemplate as a newbie (I've downgraded my self
assessment after this discussion) is knowing whether the token-stream
solution or the multiple-add solution is the pedantic one.  Are there
performance advantages to one way over the other?  I'll be totally
stunned if someone offers up that they're logically the same thing.

I swear, conversing with you guys is giving me a very deep sense of
appreciation for your skills and Lucene's capabilities.

-wls

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message