lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Kor <>
Subject Re: TermVector retrieval implementation questions
Date Mon, 15 Oct 2001 02:19:36 GMT

> *) I'm planning to add another bit:
> "storeTermVector" (better name, 
> anyone?), which will indicate that the field's term
> vector will need to 
> be stored.

My counter question is this: What kind of fields
should be vectorized? Considering the uses of
TermFreqVectors, my first impression is that all
indexed fields should be vectorized, the remaining
fields that are not indexed should not be touched. If
my assertion is true (it may not be), then we won't
need the "storeTermFreqVector" bit. Otherwise, I agree
that the bit be stored in *.fnm files. 

> *) The term vector, as I understand it, is a list of
> unique terms that 
> occur in a given field. They will be stored by term
> id  (in ascending 
> order of IDs, not terms). 

Since we might want TermFreqVectors to operate over
several indexes, I thought it would be useful for the
Term ID to be equals to (field+term).hashCode(). This
way, Term IDs are universal across indexes
(contents:dog in index A == contents:dog in index B).

> In addition to the terms,
> I'm planning to 
> store the frequency of the term (the number of times
> it occurs in the 
> field). This, together with the total number of
> terms in the field, 
> should be enough to compute the term's weight,
> right? 

As far as I can see, it would only be useful to
pre-compute and store such data. Another useful bit of
information that we might want store is the
determinant of each Document's TermFreqVector. The
determinant is often used by many text processing
algorithms andd since it will probably be a long, it
only adds 4 bytes to each indexed document. 

> *) Speaking of the stored fields, someone suggested
> adding binary 
> storage to documents so that serialized objects can
> be stored. From what 
> I can see, it would be pretty easy to define a new
> field type that 
> stores binary data, add a flag into the bits stored
> in fdt file for this 
> field, and then write it out as an array of bytes
> instead of a String. 
> This could be useful for my application as well,
> although currently I 
> have a workaround so this is not required. Any votes
> for or against 
> adding this feature?

Although I could see some advantage in storing binary
data, I don't feel that it adds any value to Lucene's
role of being a search engine. Such binary information
should be stored in an application that is designed
for this, such as a database. If this is really
required, a field can always be added in Lucene to
store a reference to the binary data (eg, the primary
key value).

> *) Preliminary file structures. These are the files
> I'm planning to add 
> to each segment:

I'ld need more time to analyze this. My current
project should end this wednesday and I'll be able to
have a look at the file format then. Will reply on
thursday or friday.

> *) I don't see any place to apply the trick used in
> the "tii" and "tis" 
> files - namely loading every 128th element into
> memory and using that as 
> an index into a larger file. I don't think this can
> be applied because 
> we are really not "searching" for anything, we just
> do direct access by 
> document id. Am I missing anything?

This *could* be applied if the hashcode idea I
mentioned above is used. 

Do You Yahoo!?
Make a great connection at Yahoo! Personals.

View raw message