lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1120) Use bulk-byte-copy when merging term vectors
Date Mon, 07 Jan 2008 15:23:34 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556586#action_12556586
] 

Michael Busch commented on LUCENE-1120:
---------------------------------------

{quote}
I think we should commit this for 2.3? It's a sizable gain in merging
{quote}

HI Mike,

I'm planning to branch the trunk today. Considering the file format changes, it might be a
bit risky to apply this patch last minute. I think we should commit this for 2.4. What do
you think?

-Michael

> Use bulk-byte-copy when merging term vectors
> --------------------------------------------
>
>                 Key: LUCENE-1120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1120.patch
>
>
> Indexing all of Wikipedia, with term vectors on, under the YourKit
> profiler, shows that 26% of the time (!!) was spent merging the
> vectors.  This was without offsets & positions, which would make
> matters even worse.
> Depressingly, merging, even with ConcurrentMergeScheduler, cannot in
> fact keep up with the flushing of new segments in this test, and this
> is on a strong IO system (Mac Pro with 4 drive RAID 0 array, 4 CPU
> cores).
> So, just like Robert's idea to merge stored fields with bulk copying
> whenever the field name->number mapping is "congruent" (LUCENE-1043),
> we can do the same with term vectors.
> It's a little trickier because the term vectors format doesn't quite
> make it easy to bulk-copy because it doesn't directly encode the
> offset into the tvf file.
> I worked out a patch that changes the tvx format slightly, by storing
> the absolute position in the tvf file for the start of each document
> into the tvx file, just like it does for tvd now.  This adds an extra
> 8 bytes (long) in the tvx file, per document.
> Then, I removed a vLong (the first "position" stored inside the tvd
> file), which makes tvd contents fully position independent (so you can
> just copy the bytes).
> This adds up to 7 bytes per document (less for larger indices) that
> have term vectors enabled, but I think this small increase in index
> size is acceptable for the gains in indexing performance?
> With this change, the time spent merging term vectors dropped from 26%
> to 3%.  Of course, this only applies if your documents are "regular".
> I think in the future we could have Lucene try hard to assign the same
> field number for a given field name, if it had been seen before in the
> index...
> Merging terms now dominates the merge cost (~20% over overall time
> building the Wikipedia index).
> I also beefed up TestBackwardsCompatibility unit test: test a non-CFS
> and a CFS of versions 1.9, 2.0, 2.1, 2.2 index formats, and added some
> term vector fields to these indices.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message