lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: search quality - assessment & improvements
Date Wed, 18 Jul 2007 23:54:38 GMT

: The Similarity portion of the payload functionality could be used for
: scoring binary fields.

that can be used as a hook to decide how to evaluate an arbitrary byte[]
payload as a float for the purposes of scoring -- but it doesn't address
the problem of how do we write/read a payload which is not term specific.

Doron is looking for a way to encode in the index arbitrary statistics
which are not specific to a single term instance (or even to a specific
document) ... mainly the average length of a field per doc.  what we were
speculating on is the notion of a generic API for writing an arbitrary
"payloads" wih each segment, and registering a PayloadMerger hook that
would give the IndexWriter a method to call when it came time to merge
segments (so it would know how to merge the generic segment payload data).

then Doron could do something like...

   AverageLengthPayloadMerger p = AverageLengthPayloadMerger();
   IndexWriter w = ...
   w.setPayloadMerger(p);
   foreach (input) {
      Document d = ...
      p.incrStats(computeLength(d))
   }
   w.flush();

...if a merge happens, IndexWriter would call a method on the
PayloadMerger giving it the payloads of hte segments being merged, and it
would already know about the stats it was recording from the current
segment, so it could then compute the new stats for the new segment and
return them to the IndexWriter to be written to disk.  when the flush
happens, the IndexWRiter would also call a method on the PayloadMerger
which would do roughly the same thing (except there is no merging since
we're just finsihing off a segment.

the same PayloadMerger would be used in the event of an optimize.

when opening an IndexReader, some new reader.getPayload() method would
recursively return all the generic payloads of all the existing segments,
and Doron could quickly calculate the average length for all docs to use
in his Similarity.


(NOTE: i'm really not very familiar with all the merge policy stuff, i'm
sure i'm glossing over a lot of details that would make this a lot more
complicated then the psuedo-code i'm imaginging)


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message