lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Kor <dave...@yahoo.com>
Subject Re: TermVector retrieval implementation questions
Date Mon, 15 Oct 2001 04:37:57 GMT

--- Dmitry Serebrennikov <dmitrys@earthlink.net>
wrote:
> 
> 
> Dave Kor wrote:
> 
> >>*) I'm planning to add another bit:
> >>"storeTermVector" (better name, 
> >>anyone?), which will indicate that the field's
> term
> >>vector will need to 
> >>be stored.
> >>
> >
> >My counter question is this: What kind of fields
> >should be vectorized? Considering the uses of
> >TermFreqVectors, my first impression is that all
> >indexed fields should be vectorized, the remaining
> >fields that are not indexed should not be touched.
> If
> >my assertion is true (it may not be), then we won't
> >need the "storeTermFreqVector" bit. Otherwise, I
> agree
> >that the bit be stored in *.fnm files. 
> >
> This is a good argument. On the other hand, there
> are planty of users 
> who use Lucene now without the vectors, so it stands
> to reason that 
> vectorization is optional. Two files are added to
> the index and some 
> time during the indexing is spent on it, so
> vectorization is not free 
> even if you don't use it on the query side.
> 
> The way things are working out right now, only
> indexed fields can be 
> vectorized. Does it make sense to vectorize keyword
> fields? If access to 
> the keyword value will endup being faster via its
> vector than via the 
> document fields, then yes.
> 
> So there it is. Still no decision, but these are the
> arguments.

Come to think of it, there may be another way to look
at this issue. As far as a TermFreqVector is
concerned, each unindexed field in a document is
equivalent to one single Term ID with a frequency of
1. So it becomes relatively trivial to handle
unindexed fields from both a computational and storage
perspective. 

So now, the question becomes an issue of whether we
allow developers to simply toggle vectorization
on/off? and if they switch on vectorization, should we
allow them to specify which fields to vectorize?

My answer for question 1 is yes, the file system
should be organized in such a way that enabling
vectorization should only result in the creation of
new files containing the term vectors without (or with
very little) changes to the original index files. 

My answer question 2 is no for now since this is only
the initial version, we should not make it overly
complex. However if there is demand, such a feature
should be implementable in the future. 


> >>*) The term vector, as I understand it, is a list
> of
> >>unique terms that 
> >>occur in a given field. They will be stored by
> term
> >>id  (in ascending 
> >>order of IDs, not terms). 
> >>
> >
> >Since we might want TermFreqVectors to operate over
> >several indexes, I thought it would be useful for
> the
> >Term ID to be equals to (field+term).hashCode().
> This
> >way, Term IDs are universal across indexes
> >(contents:dog in index A == contents:dog in index
> B).
> >
> Is hashCode unique? I thought it was only unique for
> objects that did 
> not define it and in that case it is equal to
> object's memory address. 
> String object defines it to be a hashing of
> character values, I think, 
> so it's not unique. Great idea though! I've been
> trying to resolve the 
> same problem for a while. I think I have an answer
> that I can make work 
> in the timerfame that I have, but it is
> memory-expensive and somewhat 
> computanionally expensive too.

You are grossly mistaken here. As with all hashing
algorithms, hashCode() has never guaranteed that it is
unique. Even Object's hashCode method does not
guarantee that different objects won't have the same
hashcode. I quote from JDK1.3.1's javadoc: "As much as
is reasonably practical, the hashCode method defined
by class Object does return distinct integers for
distinct objects. " 

ie, it is most likely different objects will have
different hashcodes although on very rare occasions
they might be the same. The same goes for Strings too.


The issue here is do we need 100% accuracy in
TermFreqVectors? Will 99.999% accuracy be acceptable?
(Note: Please don't quote me on that 99.999% figure, I
only plucked it out from thin air as an example)

Many text processing algorithms that uses term vectors
don't require 100% accuracy. Sometimes 100% accuracy
isn't even desired! For instance, some text clustering
algorithms even intentionally map several words to the
same Term ID as a way to reduce term vector sizes and
also to improve accuracy of clustering results. 





__________________________________________________
Do You Yahoo!?
Make a great connection at Yahoo! Personals.
http://personals.yahoo.com

Mime
View raw message