lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <>
Subject Re: TermVectorsWriter and DocumentsWriter
Date Fri, 17 Aug 2007 09:51:35 GMT
Hi Grant,

> I am wondering if TermVectorsWriter is still used?  It doesn't seem
> to be, at least not any of its methods (some of the constants still
> are, either that or my IDE is not properly finding method calls or I
> am too bleary-eyed at the moment).  It seems to be replaced by the
> writeVectors() method in DocumentsWriter.  If it is the case that
> TermVectorsWriter is not used anymore, should we remove it?

You're right, TermVectorsWriter is no longer used (except for static
constants) and I agree we should remove it.  It's been replaced by
DocumentsWriter.writeVectors.  I'll open an issue.

I went this route because profiling revealed we were spending alot of
time in there.  Even so, the time spent in writeVectors is still
surprisingly high.

One thing I have been wondering is whether it really is necessary to
sort the term vectors before writing to the index.... it is necessary
for backwards compatibility.  But, with the new TermVectorMapper, if
an application is going to sort by frequency or just retrieve a
specific subset of terms, then the sorting by term text is perhaps not
necessary and would speed up indexing to skip it.

> Also, is there a standalone test for DocumentsWriter or is it just
> through IndexWriter that it is tested?  Is there anyway
> DocumentsWriter could be split up so we could test some of these
> individual components better?

Right now it's tested only via IndexWriter; I think it would be
somewhat tricky to directly test only certain methods of

> The reason I am asking is the java-user post "Re: getting term
> offset information for fields with multiple value entiries" got me
> interested.

Hmmm.  Logically, Lucene should concatenate multiple values for a
single field, in the order they were added.  But, it depends on the
analyzer setting start/endOffset for each token.  After each field
value is processed, we set the "base" offset to 1+offsetEnd of the
last token (around line 1292 of DocumentsWriter), and then the next
time we see that field on the same doc we add in that base offset to
each token's start/endOffset (in addPosition).


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message