Subject Re: Proposal: extracting term-level stats from query process
Date Thu, 11 Mar 2004 21:23:05 GMT
Thanks for the response, Doug

My working assumption was that whatever analysis was done in evaluating the query would be
costly to repeat 
but from your breadown of what is actually required it looks like all of my requirements can
be met based on
calls to IndexReader#docFreq(term) which I would expect to be very quick. 

As for your suggestion on selecting "best fragments" using RamDirectories - for the purposes
of highlighting, the RAM indexing code and the 
highlighting code (marking up orginal text) would need to find a way to share the results
of the same tokenization pass if it was to be performant.
Before considering what is involved in coding this I did some benchmarking to compare processing
times for different operations on 
the same set of 16kb sized docs using the same (stemming) analyzer:
- Tokenization: 86 ms  (avg time taken to simply tokenize the doc)
- Highlighting:  90 ms  ( avg time taken to parse query terms, tokenize. highlight query terms
and select best fragments using current impl)
- RAM indexing: 118 ms (avg time taken to tokenize and index docs only)

As you can see, the RAM indexing approach to highlighting incurs some noticable overheads
in its first step before I consider adding the 
steps to fragment docs, query and highlight., so I'm not sure if this approach is worth pursuing.
I am tempted to just add some idf weighting into
the current highlighter's fragment selection logic.


