lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <dmit...@earthlink.net>
Subject Re: TermVector retrieval implementation questions
Date Mon, 15 Oct 2001 18:05:46 GMT
>
>
>That's something new. Unindexed fields such as keyword
>fields won't have term ids? I hope you can clarify
>further...
>
I believe keywords are indexed, just not tokenized. So the entire field 
is treated as a single term.
This is typically used for storing fields like "price" or "id" or 
what-not that is more of a typical database-style one field - one value 
situation.

>
>Hmm.. will there be a way we can convert/add
>vectorization to the old segments? The users may want
>some kind of migration path to the new format other
>than reindexing the entire index. 
>
Yes and no. If you add a document to the old index, you could declare 
new fields that are vectorized and then, as the segments are merged, the 
new segments will carry the vectorization data. However, (a) changing 
previously un-vectorized fields to vectorized would cause a problem 
during the merge, and (b) old documents still will not have any vectors, 
but of course they won't have vectorized fields either.

So it is self consistent and old indexes can be used with the new 
engine. Also new indexes can be used with an old engine, but 
vectorization data will be lost after a merge, and vectorization files 
will be left behind when their segment is deleted.

As we've been saying, this is the first pass implementation, so all of 
this is not set in stone. Most of this can be improved to some degree.

- We could probably support switching vectorization of a field on and 
off between documents. In this case, once the field is vectorized at 
least in one document, all documents with this field will return vectors 
for it, but the vectors will be blank for those where the field was not 
vectorized. Is this preferrable?

- We can improve cleanup of files by deleting "<seg>.*" instead of a 
list of explicitly enumerated files as is done now. This won't help with 
previous engines, but it will help going forward if we need to add other 
files in the future. Does anyone know why the cleanup is done the way it 
is currenly?

>The only caveat is that I would prefer that the unique
>term id generation be computationally fast and low on
>storage requirements. (yes, I know this part only
>affects indexing.. I'm just trying to stick to
>lucene's goal of fast searching and fast indexing) 
>
Yes, I know. Me too. Interestingly enough, indexing seems to be 
completely IO-bound. I was watching CPU monitor last night as I was 
running some simple indexing and CPU never hit higher then 5% 
utilization. I didn't have a chance to compare this to a previous 
version yet. Does anyone know if this is expected behavior or is it 
because I managed to break something?



Mime
View raw message