lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Date Range Query Feature Implementation
Date Mon, 30 Apr 2012 09:28:26 GMT
Hi,

> Hello Uwe, you bring up some very valid points. We did not utilize those
classes
> as we were not aware of how to use these classes.

That's not complicated:
- on indexing use NumericField instead of plain text Fields
- on query, instantiate NumericRangeQuery or NumericRangeFilter. This is not
possible by the default stock QueryParser (maybe that's your problem), but
you can use Query Parser to parse the user-entered term query and then add
NumericRangeQuery by wrapping the parsed query and the numeric Query in a
BooleanQuery or FilteredQuery as MUST clauses. Alternatively use
NumericRangeFilter as 2nd parameter in the IndexSearcher.search method.

> Also while, it is indeed infinitely faster to use Conjunction scorer, bear
in mind
> our algorithm only does an intersection on the set of results returned in
> IndexSearcher, this means that we are perhaps doing intersection on a
smaller
> subset.

That is true, but all those "post result-collection" filtering approaches
have several drawbacks:

- You have to increase the size of your result window (the number of top-n
results), depending on the number of documents that *might* be removed. This
raises memory requirements and may also slows down the query (especially if
the filter will remove lots of documents, because the result window must be
very large in that case).

- It's not useable for all types of data or queries, because in lots of
cases you don't know if the top-n results really contain any documents that
would survive filtering. One example: Imagine the user searches for a very
frequent term that hits lots of results and some documents are scored very
high (because they seem to be very important to the scorer based on tf/idf).
But those high-scoring documents don't fall into the date range; the use
will hit "no results" found, although there are results.

> The methodology described does a query for a date range on an entire
index,
> and then does a query for a term on an entire index and then intersects
those
> results which may be slower.

It uses ConjunctionScorer, so the query scorers are so-called "advanced",
means, it will not scan the whole posting lists. The numeric query/filter
and the main query will hand over to each other depending on which scorer
provides the next larger document ID. There is one drawback:
NumericRangeQuery still has to scan the postings, if it is used in "Filter"
mode, where it builds a BitSet of matching documents, but that's generally
fast (but still the biggest impact for numeric queries).

> I imagine most users don't look beyond the top twenty documents anyway, so
> there is no reason to query the entire index for the subset of documents
that fit
> that date range. A "lazy" (term used loosely) loading type of solution may
be
> best, because if you really break it down, a date range is more like a
filter for a
> set of results, and less of something that you have to query against the
entire
> database.

See above.

> Given the aforementioned concepts, perhaps a combination of the two ideas
> may be the best solution for an implementation, I will continue to think
about
> it, thank you very much for your input.
> 
> Again, these are just some ideas I am throwing around here, I obviously
can't
> speak in absolute terms because I do not know Lucene very well, but these
are
> some thoughts I am having. Any and all feedback is appreciated, and once
> again,
> 
> Thank you for your input,

I am glad to help!

Uwe

> On Apr 30, 2012, at 3:03 AM, Uwe Schindler wrote:
> 
> > Hi,
> >
> > Thanks for your input. One citation from your report:
> >
> > "These types of searches are uncommon, and thus programmers don't
> > optimize for this case. Lucene, for example, has the ability to filter
> > search results using date-ranges, but it is a slow, naive algorithm
> > implemented through lexographic range searching on a custom field.
> > Which is a user level hack that works ineffectively. There are no
> > known other ways of performing a date range search."
> >
> > Since Lucene 2.9 / Solr 1.4, Lucene can handle numerical ranges
> > without "a slow, naive algorithm", see NumericRangeQuery and
> > NumericField. As every date can be represented as a number (e.g. year
> > as integer, or milliseconds since 1970 as long,...), date searches can
> > be done easily with Lucene (and very fast, because the intersection
> > between the NumericRangeQuery and the TermQuery are done using
> > ConjunctionScorer which does *not* naivly iterate the postings).
> >
> > Did you consider this in your implementation?
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >
> >> -----Original Message-----
> >> From: John Mercouris [mailto:jmercouris@gmail.com]
> >> Sent: Monday, April 30, 2012 9:24 AM
> >> To: dev@lucene.apache.org
> >> Subject: Date Range Query Feature Implementation
> >>
> >> Hello we (John Mercouris & Nick Zivkovic) have implemented date range
> >> search functionality into Lucene as part of a class project. The
> > implementation
> >> is detailed in the PDF attached. The source is available for download
> >> from github at the URL:
> >> git://github.com/cs429-ir/date-range-search.git
> >>
> >> We hope that you find this useful,
> >>
> >> -John & Nick
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> > additional commands, e-mail: dev-help@lucene.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
> commands, e-mail: dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message