lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <>
Subject Re: TermVector retrieval implementation questions
Date Mon, 15 Oct 2001 02:59:12 GMT

Dave Kor wrote:

>>*) I'm planning to add another bit:
>>"storeTermVector" (better name, 
>>anyone?), which will indicate that the field's term
>>vector will need to 
>>be stored.
>My counter question is this: What kind of fields
>should be vectorized? Considering the uses of
>TermFreqVectors, my first impression is that all
>indexed fields should be vectorized, the remaining
>fields that are not indexed should not be touched. If
>my assertion is true (it may not be), then we won't
>need the "storeTermFreqVector" bit. Otherwise, I agree
>that the bit be stored in *.fnm files. 
This is a good argument. On the other hand, there are planty of users 
who use Lucene now without the vectors, so it stands to reason that 
vectorization is optional. Two files are added to the index and some 
time during the indexing is spent on it, so vectorization is not free 
even if you don't use it on the query side.

The way things are working out right now, only indexed fields can be 
vectorized. Does it make sense to vectorize keyword fields? If access to 
the keyword value will endup being faster via its vector than via the 
document fields, then yes.

So there it is. Still no decision, but these are the arguments.

>>*) The term vector, as I understand it, is a list of
>>unique terms that 
>>occur in a given field. They will be stored by term
>>id  (in ascending 
>>order of IDs, not terms). 
>Since we might want TermFreqVectors to operate over
>several indexes, I thought it would be useful for the
>Term ID to be equals to (field+term).hashCode(). This
>way, Term IDs are universal across indexes
>(contents:dog in index A == contents:dog in index B).
Is hashCode unique? I thought it was only unique for objects that did 
not define it and in that case it is equal to object's memory address. 
String object defines it to be a hashing of character values, I think, 
so it's not unique. Great idea though! I've been trying to resolve the 
same problem for a while. I think I have an answer that I can make work 
in the timerfame that I have, but it is memory-expensive and somewhat 
computanionally expensive too.

>>In addition to the terms,
>>I'm planning to 
>>store the frequency of the term (the number of times
>>it occurs in the 
>>field). This, together with the total number of
>>terms in the field, 
>>should be enough to compute the term's weight,
>As far as I can see, it would only be useful to
>pre-compute and store such data. Another useful bit of
>information that we might want store is the
>determinant of each Document's TermFreqVector. The
>determinant is often used by many text processing
>algorithms andd since it will probably be a long, it
>only adds 4 bytes to each indexed document. 
Ok. Maybe I'll leave the space for it, or maybe we'll add it later. I 
have no idea how to compute the determinant. :)

>>*) Preliminary file structures. These are the files
>>I'm planning to add 
>>to each segment:
>I'ld need more time to analyze this. My current
>project should end this wednesday and I'll be able to
>have a look at the file format then. Will reply on
>thursday or friday.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message