lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <dmit...@earthlink.net>
Subject Re: TermVector support - first release
Date Sat, 20 Oct 2001 06:18:08 GMT


Doug Cutting wrote:

>Dmitry,
>
>Wow!  This looks great!
>
>I was preparing a response to your questions of last weekend, but it seems
>like you figured out a lot of it on your own.  I've attached that response
>anyway, in case you're still interested.
>
>Once we get 1.2 out the door I'd like to make you a committer (providing
>others approve) so that you can commit these changes yourself.  I'd also
>still like to review them a bit more, but some of that can  happen after
>they are committed.
>
Sounds good.

>
>
>My biggest question is about the field-orientation of this.  I had imagined
>this to be more document oriented, that there would be a single
>TermFreqVector per document, rather than one per field.  That would simplify
>things a bit, and make it a bit more efficient.  Of course, one could always
>construct the full-document freq vector by combining the field vectors, but
>the question is, do folks need the field-specific vectors?
>
Well, I figured that it is easier to combine them than to split them. 
Also, you can always just designate one field as vectorized and that 
would give one vector per document. I like field orientation because 
this can lead to other applications besides the traditional clustering. 
Like the idea of using this for quick field contents access and for use 
in sorting by field. Efficiency-wise, I had a method in there originally 
that would return a combined vector (rather than an array of them). I 
took it out because I didn't need it and there was planty of other work 
I could focus on :). But I think with that method, the file reading code 
can be just as efficient when it knows that you want to combine the data 
even as it is reading it. Besides, terms are inherently field-specific, 
and so two terms with the same text but from different fields would have 
a different number anyway.

>
>
>Overall, Bravo!
>
Thanks. :)  
I learned a lot!

>>From: Dmitry Serebrennikov [mailto:dmitrys@earthlink.net]
>>
>
>>Is there any particular reason why the "tokenized" bit is
>>stored in the fdt file, while the "indexed" bit is stored
>>in the fnm?  Why not put both in fnm?
>>
>
>A given field is permitted to be tokenized in some documents
>in an index and not tokenized in others.  
>
Interesting. I missed that.

>However all
>instances of a field in an index must be either indexed or
>not.  [[ Explain why... ]]
>
>>I'm planning to add another bit: "storeTermVector" (better
>>name, anyone?), which will indicate that the field's term
>>vector will need to be stored.
>>
>
>I like the option of storing vectors for all indexed
>fields.  Perhaps IndexWriter's constructor could take
>another optional flag that determines whether it stores
>vectors.  It would be illegal to specify this except when
>the 'create' flag is true.  Does that sound reasonable?
>
Sounds a bit too restrictive, but I don't have a good argument against 
it. I like how this would limit the possibility of mistakes and 
inconsistencies, but I don't like how this applies to all indexed 
fields. Do you really need to store vectors for some keyword fields 
(like a Document ID of some kind for example, or dates. Also fields that 
are used to "hack" the scoring mechanism for document boosing)?

>
>
>>The term vector, as I understand it, is a list of unique
>>terms that occur in a given field. They will be stored by
>>term id (in ascending order of IDs, not terms).
>>
>
>I think term id order and term order should in fact be the
>same.  Terms can be assigned id's based on their position in
>the TermEnum.
>
Yes. That's how it ended up being.

>
>
>In Lucene there are only two index creating operations:
>create an index for a single document, and merge N indexes
>into a single index.  If these can both be done efficiently
>then indexing is efficient.
>
>Creating a vector index for a single document with term
>numbers that correspond to positions in the TermEnum is easy
>enough.  Efficient merging is also possible, using the
>technique I mentioned in a previous message.  So I don't see
>why these shouldn't be used for term ids.
>
There are two kinds of merging. One that is "static" merging, where a 
new segment is created. This one is done. The other is "dynamic" merging 
where SegmentsIndexReader spans multiple segments and dynamically merges 
data from them into a single address range (doc nums, term nums, etc.). 
This is the one that still needs to be addressed. There is also a third 
kind of merge, which is also dynamic but occurs in the MultiSearcher and 
merges across Searchers.

>
>
>>In addition to the terms, I'm planning to 
>>store the frequency of the term (the number of times it occurs in the 
>>field). This, together with the total number of terms in the field, 
>>should be enough to compute the term's weight, right? My application 
>>doesn't need these weights, so I'm not sure what people need in this 
>>regard. Please advise.
>>
>
>IndexReader.docFreq(Term) and IndexReader.maxDoc() are
>sufficient to compute IDF weights, the standard used by Lucene.
>
I was referring to the TermWeightVector that you proposed earlier. If 
that can also be computed as is, so much the better.

>
>
>>In addition to the terms and frequencies, I will also
>>store positions in which these terms occur in the
>>field. Actually, this is already stored (used by the
>>TermPositions functionality), so I will only store
>>pointers into the prx file. This may not be needed for
>>clustering, but I need this for my application. Some of
>>the text processing that we do is based on relative
>>positioning of terms in a document.
>>
>
>Since positions are the most high frequency data, this will
>have a major impact on index size and performance.  I
>suggest that this data be stored at least at the end of the
>vector data for a document, and perhaps even in a separate
>file.  It might probably also be good to have a flag to
>disable this, for folks who just want vectors.
>
It is stored in a separate file after all. I was able to create a single 
segment with pointers into the prx file (so there was no increase in 
index size), but merging (the "static" kind) of this proved difficult if 
not impossible. So they are in a separate file now. As far as the flag - 
I'm all for it. Right now, the data is not being read unless requested 
at query time. However it *is* being written during indexing which costs 
time and disk space.

>>These are the files I'm planning to add to each segment:
>>     "fvx" file - Field Vector Index. Modeled on the fdx file. Has a 
>>fixed length, 8-byte, record per document in a segment. The 8 bytes 
>>store a long pointer into the "fvt" file where the record for this 
>>document begins.
>>
>
>Sounds good.  You can seek based on docId*8.  Easy to write,
>easy to merge.
>
I learn from the best.  :)

>>field_record :
>>      [VInt] - field number, just like in the "fdt" file
>>      [byte] - flags, don't know if we are going to need any, 
>>but seems like we might?
>>
>
>Leave 'em out if you don't have a use.
>
I did.

>>[VInt] - numTerms, count of unique terms in the vector, 
>>number of term records that follow
>>      { term_record, ... } - term records, as many as 
>>specified above, represent unique terms in the field
>>
>>      term_record:
>>      [VInt] - term id increment, restarts from 0 for each field
>>      [VInt] - term frequency in this field, used for weight 
>>calculations and for the count of positions in "prx" file
>>      [VInt] - "prx" pointer increment, restarts from 0 for each field
>>
>
>As mentioned above, I think the "prx" data should at least
>be moved to the end of the document_record.  I am also not
>sure what the increment is relative to.  I assume it is a
>pointer into the "prx" file's data for a term.  The index
>already stores (in the "tis" file) a pointer to the start of
>the term's data in the "prx" file.  Is this relative to
>that?  Since a vector only has one entry per term, there's
>nothing else in this vector that it could be relative to.
>
It's relative to the previous prx pointer (for the preceeding term in 
this document). The pointers are to a different file, so it makes more 
sense because in that file proximities are organized by document first 
and by term second, rather then by term first document second as in the 
prx file.

The final file formats are slightly different but generally the same. 
The maxTerms is the length of the field measured in terms. I thought 
that a term weight for the purposes of the TermWeightVector would be 
something like term_freq / doc_length. This could also be used if one 
wanted to allocate an array for rebuliding the document content from 
positions (probably not too useful). It is stored as an increment to the 
numTerms, so it probably won't take up that much space.

>>A couple of questions on the file formats that I would
>>really like feedback on:
>>
>>Specifically, I'm trying to identify what makes access to
>>the document fields ("fdx" and "fdt" files) slow, and make
>>sure I avoid those problems. From what I can tell, the
>>only thing that makes that access slow is the size of the
>>document data, in which case we have nothing to worry
>>about. Is that right?
>>
>
>Why do you say they're slow?  Because access is discouraged
>from the inner search loop?  Currently the inner search loop
>only needs to read a vInt or two per document scored.  These
>are primarily sequential reads from a file.  If you add to
>that two random access seeks plus reading and constructing a
>document object, then search gets a *lot* slower.  So
>restricting calls to IndexReader.doc() to documents that are
>to be displayed, and not consulting document field values
>during the selection of documents makes search a lot faster.
>
Right. Also, one would have to parse the data obtained from document 
fields. In my experience so far, reading term vector data in the inner 
loop is about two times faster than accessing the same data from the 
document, without even parsing it or doing anything else. I think this 
is mostly because it is stored in a more compacted form and thus IO is 
minimized.

>>Finally, users are likely to access termvectors from a given field 
>>only. This may be a good reason to optimize access to each 
>>field_record in the proposed "fvt" file.
>>
>
>Perhaps.  I had imagined that folks would usually want the term freq
>vector for the whole document, rather than field by field.  If you did
>things this way it would remove the additional redirection.
>
Yes. This is a good point. Removing a file seek would probably help a 
lot. On the other hand, perhaps we can leverage the fact that two files 
can reduce contention over the file handle.


Doug, thanks very much for your comments.
Dmitry.


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message