lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <>
Subject Re: Soliciting Design Thoughts on Date Searching
Date Sun, 04 Mar 2007 23:36:58 GMT
The only thing I'd add to Steve's mail is that it's probably easier
on a conceptual basis to add the dates in as a separate field.
The muliple add method should be simpler. Consider if you just insert
dates as YYYYMMDD, into your token stream and have, say
19990715, 20070122, and 2go. a range from July 15, 1999 to
January 22, 2007 would hit on 2go. I think you'd be far better off to
have a field that contained only dates (multply added) and use
that for your searches. Transform the dates using the DateTools
class so they are lexically ordered. In fact, DateField is
deprecated in Lucene 2.1 (don't know about earlier). All
DateTools really does is transform dates into lexically
orderable strings for indexing, not a separate data type....

If you've figured out the parsing rules (and you're right, you're
certifiable if you like writing parsers <G>) then stuffing them into
a Date object and using DateTools is not going to persent a

And I completely agree with you that Lucene is written in a way
that allows me to ignore 99% of what's going on under the covers
and just get on with solving my problems. I'd add that so far, each
time I need some other tweak, what I usually have to do is stare
at the documentation for a while longer and good things happen, and
if that fails, the guys have been enormously helpful.


On 3/1/07, Steven Parkes <> wrote:
> If all you want to do is find docs containing dates within a range, it
> probably doesn't make much difference whether you give dates their own
> field or put them into your content field. It'll probably be easier to
> just add them into the token stream since that's the way the analyzer
> architecture wants to work (analyzers generally don't know anything
> about fields.) You can make the position increment work if you want, and
> it'll make phrase/span queries work better, if you need those to work.
> What is going to matter in either case is how you format dates.
> Everything in Lucene is text, so if you want to do date ranges (which
> you mentioned in your first e-mail), you need to be careful how you
> format the dates and what kinds of queries you use. See, for example,
> ols.html
> (tinyurl:
> and
> (tinyurl:
> There are also date filters (as opposed to date queries) that have
> different tradeoffs.
> Dates are kinda tricky in Lucene.
> -----Original Message-----
> From: Walt Stoneburner []
> Sent: Thursday, March 01, 2007 7:54 AM
> To:
> Subject: Re: Soliciting Design Thoughts on Date Searching
> Thank you all for the suggestions steering me down the right path.
> As an aside, the easy part, at least for me, is extracting the dates
> -- Peter was dead on about how doing that: heuristics, multiple
> regular expressions, and data structures.  As Steve pointed out, this
> isn't as trivial as it sounds - there are a lot of formats, some
> ambiguous.
> I love writing parsers (guess I'm sick in the head, eh?), so getting
> the data isn't the problem, it's knowing what format to convert it
> into and how to hand it to Lucene in a way that it'll find meaningful
> for searching.
> I had pondered making a single field with a value like:
> document.add( Field.Text( "dates", "27-Feb-1968,04-Jul-1776,01-Mar-2007"
> ));
> ...but I wasn't convinced that the Lucene date Range was going to work
> on anything other than a Date type, rather than a string of text that
> just coincidently happened to contain dates.
> Drawing back on my title example, I was under the incorrect impression
> that if I had a field and provided another value that it replaced the
> prior value.  Hoss is indicating this is not so, and that I'm safe
> adding additional values.
> document.add( Field.Text( "title", "Thanks Thomas" ));
> document.add( Field.Text( "title", "Thanks Hoss" ) );  // Does not
> stomp on Thomas. Yay!
> If I can use this technique to pile in a ton of dates, then I'm
> totally happy, you guys have pointed me in the right direction;
> celebrations all around.
> The brain scratcher, for me, was Peter's treating the dates like a
> synonym -- a clever way of looking at the problem.  Unfortunately,
> that'd be giving me too much credit, as I haven't played with that
> feature set of Lucene.  So, without trying to, Peter's sent me
> scrambling back to the API for something I wasn't aware was there.
> Steve adds to the mystery by suggesting a delimited field list, much
> like the example at the top of this message, and likewise doing some
> trickery with the token stream and a position increment of zero --
> again, a clever solution, and likewise beyond my limited Lucene
> experience.
> While I know, intellectually, that Lucene is digesting positioned
> tokens, it is so well designed that fools like me can legitimately use
> Lucene for long periods of time without actually being exposed to
> what's happening under the hood.
> The ponderance I now contemplate as a newbie (I've downgraded my self
> assessment after this discussion) is knowing whether the token-stream
> solution or the multiple-add solution is the pedantic one.  Are there
> performance advantages to one way over the other?  I'll be totally
> stunned if someone offers up that they're logically the same thing.
> I swear, conversing with you guys is giving me a very deep sense of
> appreciation for your skills and Lucene's capabilities.
> -wls
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message