lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Dmitry's Term Vector stuff, plus some
Date Thu, 26 Feb 2004 18:51:03 GMT
Bruce Ritchie wrote:
> Doug, do you believe the storing (as an option of course) of token 
> offset information would be something that you'de accept as a 
> contribution to the core of lucene? Does anyone else think that this 
> would be beneficial information to have?

I have mixed feelings about this.  Aesthetically I don't like it a lot, 
as it is asymmetric: indexes store sequential positions, while vectors 
would store character offsets.  On the other hand, it could be useful 
for summarizing long documents.

Another approach that someone mentioned for solving this problem is to 
create a fragment index for long documents.  For example, if a document 
is over, say, 32k, then you could create a separate index for it that 
chops its text into 1000 character overlapping chunks.  The first chunk 
would be characters 0-1000, the next 500-1500, and so on.  Then, to 
summarize, you open this index and search it to figure out which chunks 
have the best hits.  Then you can, based on the chunk document id, seek 
into the full text and retokenize only selected chunks.  Such indexes 
should be fast to open, since they'd be small.  I'd recommend calling 
IndexWriter#setUseCompoundFile(true) on these, and optimizing them. 
That way there'd only be a couple of files to open.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message