lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Walt Stoneburner" <walt.stonebur...@gmail.com>
Subject Re: Soliciting Design Thoughts on Date Searching
Date Mon, 05 Mar 2007 14:59:12 GMT
Erick / Steve,

  Thank you both (as well as everyone else who weighed in) on helping get to
a far more optimal solution well before any code was ever slung.

  Since we all know that someone else is going to find this in the archives
some day, I'd like to unveil the rest of my ignorance and misconceptions, on
the off chance that others following in these footsteps may be under the
same delusions about how Lucene operates under the hood.  Please forgive any
moronic questions, no matter how gross in error.

  First, the very existence of the DateField in past versions of Lucene
subtly gave me the impression that Lucene was actually data-type aware.  So
its sudden deprecation was a shocking loss.  What I quickly gathered was
that DateField was nothing more than a helper class, and that it's only
function in the universe was to inject properly formatted text dates into
the token stream.  If I'm mistaken or have a nuance wrong, please correct
me, as big errors compound from little misunderstandings.

  Second, and along the same lines because of the data-type goof, I was
operating under the impression that a document hand a single token stream
and that the meta data (the other fields) that were added to the documents
were name/value pairs (e.g. creation date, version number, author name).
What I'm picking up is that each field is its own token stream, and this is
why I can have a list of dates.  Am I still on track so far?

  Third, I read from the documentation that the .add() method is simply
appending.  At this point, I'm wondering if add is doing tokenizing at each
add(), or when the document is indexed.  This is a matter of slight
importance, especially if one's understanding of the token stream is a
little shaky.

  Document's add() method is described like this:  "Several fields may be
added with the same name. In this case, if the fields are indexed, their
text is treated as though appended for the purposes of search."  The catch
here is in understanding the behavior of "append".  Lucene's documentation
would benefit from an example.

document.add( Field.Text("somefield", "hot" ) );
document.add( Field.Text("somefield", "dog" ) );

  Did I just add "hotdog" to my document, or two tokens, "hot" and "dog"?

  The reason this is important to wrap one's mind around is when adding
multiple dates.  It had been suggested that I add dates with a separator.

document.add( Field.Text("interestingdates", "17760704,20010911,20070305" )
);

  If add() is tokening up front, then three calls would be the logical
equivalent, and I wouldn't need to artificially add separators while doing a
looping construct.

document.add( Field.Text("interestingdates", "17760704" ) );
document.add( Field.Text("interestingdates", "20010911" ) );
document.add( Field.Text("interestingdates", "20070305" ) );

  Are the three adds shown here the same as the one large add above `with
commas to Lucene?


   And finally, the question that pulls all three of these diverse issues
together:  If Lucene is really just looking a stream of tokenified text, and
dates/text are tokens no different from one another, and I can inject
properly formatted dates into the stream without requiring an honest to
goodness data type or artificial list -- what would prevent one from
detecting a date looking "thing" in a stream and injecting a date-looking
token at that place in the document?

  For example, if "December 25th, 2007" was in the text stream, replace that
block of text with "20071225" and let the tokenizer do its thing as normal,
none the wiser.  In theory, wouldn't this let the user be able to search the
regular text body for words and dates at the same time?

  What I'm hoping you'll say is, yes, you could do that, ...but you wouldn't
want to for performance reasons.

Thanks again,
-wls

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message