lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <>
Subject Re: TermVector retrieval implementation questions
Date Mon, 15 Oct 2001 05:45:44 GMT
>Come to think of it, there may be another way to look
>at this issue. As far as a TermFreqVector is
>concerned, each unindexed field in a document is
>equivalent to one single Term ID with a frequency of
>1. So it becomes relatively trivial to handle
>unindexed fields from both a computational and storage
Well, I think unindexed field can't really be used for tvs because there 
are not term ids for them. I mean we could invent them but then there is 
no facility to compare them and see that two documents use the same 
term. And if there was, well that's just like a keyword field. So I 
think tvs only apply to indexed fields, period.

>So now, the question becomes an issue of whether we
>allow developers to simply toggle vectorization
>on/off? and if they switch on vectorization, should we
>allow them to specify which fields to vectorize?
Also tokenized vs. keyword. But ok.

>My answer for question 1 is yes, the file system
>should be organized in such a way that enabling
>vectorization should only result in the creation of
>new files containing the term vectors without (or with
>very little) changes to the original index files. 
Right. That's how it works. Well... Actually, right now the files will 
be created no matter what for new segments. However, old segments that 
do not have these files work also. I agree with your point in general 

>My answer question 2 is no for now since this is only
>the initial version, we should not make it overly
>complex. However if there is demand, such a feature
>should be implementable in the future. 
Ok, so you are voting for vectorization flag per index. And if set, it 
applies to all indexed fields (tokenized and keyword). This could work. 
Right now, I have it on a per-field basis (trying to change your mind 
after a field is first used in any document causes an exception, just as 
it currently does with the isIndexed flag). Like you said, this is only 
the initial version. Let's see what other ideas happen.

>You are grossly mistaken here. As with all hashing
>algorithms, hashCode() has never guaranteed that it is
>unique. Even Object's hashCode method does not
>guarantee that different objects won't have the same
>hashcode. I quote from JDK1.3.1's javadoc: "As much as
>is reasonably practical, the hashCode method defined
>by class Object does return distinct integers for
>distinct objects. " 
>ie, it is most likely different objects will have
>different hashcodes although on very rare occasions
>they might be the same. The same goes for Strings too.
Right, except for Strings. Here's a quote from the String javadoc:
Returns a hashcode for this string. The hashcode for a String object is 
computed as 
 s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
  using int arithmetic, where s[i] is the ith character of the string, n 
is the length of the string, and ^ indicates exponentiation. (The hash 
value of the empty string is zero.)

Regardless, it is even less unique than I thought, therefore I don't see 
how we could use for term ids. Am I missing something?

>The issue here is do we need 100% accuracy in
>TermFreqVectors? Will 99.999% accuracy be acceptable?
>(Note: Please don't quote me on that 99.999% figure, I
>only plucked it out from thin air as an example)
Well, my app needs them to be exact. It never occured to me that term 
ids could be
non-unique and still be useful. Lucky that I'm the one building it! :)

>Many text processing algorithms that uses term vectors
>don't require 100% accuracy. Sometimes 100% accuracy
>isn't even desired! For instance, some text clustering
>algorithms even intentionally map several words to the
>same Term ID as a way to reduce term vector sizes and
>also to improve accuracy of clustering results. 
Interesting! Live and learn! :)
Meaning that this goes beyound what Lucene does with stemming? For 
example, two absolutely unrelated words (like "cat" and "semaphore") 
might get mapped to the same id? I suppose statistically this might 
still work out to a pretty good clustering.

If one really wanted to do this, I think it will be possible to just use 
upper N bits of the term id that Lucene will carry and achieve the same 

View raw message