lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Lamprecht <clampre...@gmail.com>
Subject Proposed Lucene modification - FieldCollector
Date Thu, 10 Mar 2005 00:57:24 GMT
I've been reading the recent discussion in lucene-user about
selective/lazy field retrieval and I'd like to propose an idea for a
modification to lucene, and get some input.

First let me illustrate some common use cases to justify this
modification.  It seems like a common requirement is to collect a
count of how many times a certain value appears in a certain field (or
set of fields), and then display the resulting "Top 10" most popular
field values.  A concrete example might be, user searches for
"vertebrate" and the top 4 values for some field may be "Canine (10),
Feline (7), Primate (4), Horse (2)".  Another example is in Google
Desktop search - when I do a search, the header bar shows me how many
results of each type were a match: "3,819 emails - 1,450 files - 2,206
web history - 0 chats".  One could imagine doing this in Lucene using
a field "doctype" with tag values such as "email", "file", "web",
"chat", etc, and keeping counts.

The problem is a performance issue.  To get complete statistics like
above, you currently have to iterate through the result set and pull
each Document from the Hits.  If you don't need exact stats or if you
only need the rankings (and not the counts), then you may be able to
just iterate through the first few hundred results, but even this has
a significant performance hit.

So my idea is to add something similar to the HitCollector.  The
abstract class might be something like:

public abstract class FieldCollector {
protected String fieldName;

public FieldCollector(String name) { this.fieldName = name; } 

/** Returns the field name, so the Searcher knows which fields to pull */
public String name() { return name; }

/**
* Called once for every non-zero scoring document, 
* for the set of field names requested in the call to IndexSearcher.search().
* Note: fieldValue can be null if this document does not contain this field
*/
public abstract void collect(int doc, float score, String fieldName,
String fieldValue);
}


Then the call to searcher.search() might look something like this:

searcher.search(query, new FieldCollector("doctype") {
 public void collect(int doc, float score, String name, String value) {
    // example usage: keep a tally of how many times each field occurred   
    if (value != null)
      incrementFieldCount(name, value); // defined elsewhere
 }
});


This would require modifying Lucene to pull only these selected
Document fields and call this collect() method during search.  I know
this will slow the search down -- the question is, how much?  Since
IndexReader reads fields sequentially, one could index Documents so
that the small, commonly "collected" fields come first, and the big
fields (that aren't ever "collected") come last.  Another option is to
allow the client code to put a limit on how many times the searcher
calls collect(), e.g., only call it for the first 1000 results, or
with probability 1/100.  Or let the collect() method return a boolean
to signal whether to continue collecting, etc.

Comments?  I actually need something like this for a project I'm
working on, and I'd be happy to implement it.

-Chris

PS-Let me know if I should post this to the lucene-user list instead

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message