lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bruce Ritchie <>
Subject Re: Dmitry's Term Vector stuff, plus some
Date Wed, 25 Feb 2004 22:25:45 GMT wrote:
> The conclusion?
> * Reading text from the index is very quick (6 meg of document text in 10ms!)
> * Tokenizing the text is much slower but this is only noticable if you're processing
a LOT of text. 
> (My docs were an average of 1k in size so only took 5ms each)
> * The time taken for the highlighting code is almost all down to tokenizing
> Bruce,
> Could a short term ( and possibly compromised )solution to your performance 
 > problem be to offer only the first 3k of these large 200k docs to
> the highlighter in order to minimize the amount of tokenization required? 
 > Arguably the most relevant bit of a document is typically in the first 1k anyway?
> Also, for the purposes of the search index are you doing all you can to strip
 > out a lot of the duplicated text (>> your comments etc) from the
> reply posts typically found in your forums?

The text in question isn't forum text - it's knowledge base documents. I do strip out just
everything I can (all html tags are stripped for example) however plain and simple some of
docs are actual manuals which tend to be on the larger side. For these types of docs the 1k
tends to be rather useless information (introductory text, etc) with regards to the search,
so I 
can't just toss most of the document where the actual relevant information is.

> My timings seem in line with your estimates - a 1k doc takes 5ms so a 200k doc is close
to a second!

Yes, it's painful. One way or another I'm going to find a way to store the offset information
I need 
so that retokenization isn't required - generating really good search summaries is just too
great a 
benefit to our application to pass up.


Bruce Ritchie

View raw message