lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Proposal: extracting term-level stats from query process
Date Thu, 11 Mar 2004 21:52:04 GMT
markharw00d@yahoo.co.uk wrote:
> As for your suggestion on selecting "best fragments" using RamDirectories - for the purposes
of highlighting, the RAM indexing code and the 
> highlighting code (marking up orginal text) would need to find a way to share the results
of the same tokenization pass if it was to be performant.

An easy way to do this would be to simply tokenize once, saving the 
results in an array, then define a tokenizer that just returns 
successive elements of the array.

> Before considering what is involved in coding this I did some benchmarking to compare
processing times for different operations on 
> the same set of 16kb sized docs using the same (stemming) analyzer:
> - Tokenization: 86 ms  (avg time taken to simply tokenize the doc)
> - Highlighting:  90 ms  ( avg time taken to parse query terms, tokenize. highlight query
terms and select best fragments using current impl)
> - RAM indexing: 118 ms (avg time taken to tokenize and index docs only)

That's slower than I'd expect.  Any idea how much of this time is the 
tokenizer proper and how much is the stemmer, etc?  The easiest way I 
can think to check this would be to time the tokenizer alone, then 
tokenizer + stemmer, then tokenizer + stemmer + stop, etc.

> As you can see, the RAM indexing approach to highlighting incurs some noticable overheads
in its first step before I consider adding the 
> steps to fragment docs, query and highlight., so I'm not sure if this approach is worth
pursuing. I am tempted to just add some idf weighting into
> the current highlighter's fragment selection logic.

Yeah, indexing is a little heavyweight for this.  It would be 
interesting to write an in-memory version of IndexReader and IndexWriter 
that don't serialize anything to bytes.  This would ignore the 
Directory, and just represent the index as, e.g. a TreeMap of Terms, 
with ArraLists for TermDocs, etc.  Someday, when I have some free time 
maybe I'll give this a try...  Such a thing would be much faster, and 
might be suitable for using sentence or fragment indexing to implement 
summaries.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message