lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bruce Ritchie <>
Subject Re: Dmitry's Term Vector stuff, plus some
Date Fri, 27 Feb 2004 16:59:55 GMT
Doug Cutting wrote:
>> Doug, do you believe the storing (as an option of course) of token 
>> offset information would be something that you'de accept as a 
>> contribution to the core of lucene? Does anyone else think that this 
>> would be beneficial information to have?
> I have mixed feelings about this.  Aesthetically I don't like it a lot, 
> as it is asymmetric: indexes store sequential positions, while vectors 
> would store character offsets.  On the other hand, it could be useful 
> for summarizing long documents.

I'm sorry, I wasn't clear in my description. I was thinking of storing the token offset information

*in addition* to the sequential positions that were (temporarily?) removed from the term vector
just prior to it being committed, not in exclusion.

> Another approach that someone mentioned for solving this problem is to 
> create a fragment index for long documents.  For example, if a document 
> is over, say, 32k, then you could create a separate index for it that 
> chops its text into 1000 character overlapping chunks.  The first chunk 
> would be characters 0-1000, the next 500-1500, and so on.  Then, to 
> summarize, you open this index and search it to figure out which chunks 
> have the best hits.  Then you can, based on the chunk document id, seek 
> into the full text and retokenize only selected chunks.  Such indexes 
> should be fast to open, since they'd be small.  I'd recommend calling 
> IndexWriter#setUseCompoundFile(true) on these, and optimizing them. That 
> way there'd only be a couple of files to open.

In some respects that still doesn't solve my core issue - it just mitigates it for large documents.

Retokenization seems to me to be a task that can be done away with the right design. Reducing
time it takes to display search results by a minimum of 75ms (5ms per document x default of
documents for my application) and more likely 100-150ms (7-10ms per document) seems to be
worthwhile endeavour. Of course, on a multiprocessor machine I could have the highlight code
multithreaded which would reduce that time somewhat.

I was also thinking about another approach where I store the token offset information in seperate

unindexed fields - one new field for token offset information for each original field. I could

generate this information in a seperate analyzer run when the document is added to the index.
should satisfy my goal of having the offset information be easily accessible at search time.
up with a decent encoding mechanism to store all the term offsets in a single field shouldn't
be too 
difficult. Do you believe this would be a worthwhile approach?


Bruce Ritchie

View raw message