couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Damien Katz (JIRA)" <j...@apache.org>
Subject [jira] Commented: (COUCHDB-738) more efficient DB compaction (fewer seeks)
Date Thu, 13 May 2010 17:52:42 GMT

    [ https://issues.apache.org/jira/browse/COUCHDB-738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867197#action_12867197
] 

Damien Katz commented on COUCHDB-738:
-------------------------------------

I've been thinking about this issue, and I think storing the whole rev tree in the by_seq
index is a bad idea.

I'm also thinking of ways to make the compaction faster.

Store the full_doc_info outside the by_id btree, but store the doc_info instead (with it's
pointers to the main doc and conflicts) with a pointer to the full_doc_info as well. On reads,
this avoids the overhead of loading up the ful_doc_info just to get the main doc. Updates
and replications reads (that request the rev_info) will have to load up the full_doc_info
with an extra read, but unlikely to be an extra disk io since it will be close the most recent
doc revision.

The by_seq index will also have the doc_info and a pointer to the full_doc_info.

Then on compaction, scan the by_seq index, copying over the full_doc_info and the documents
and attachments into a new file. Each full_doc_info should be linked to the next with an offset
to next's file pos.

Then scan the newly written full_doc_info, converting them to doc_infos and pointers to the
full_doc_info, and writing them out consecutively to the end of the file.

Then sort just this portion of the file, on disk, by the id in the doc_info. This is the most
expensive part of the compaction, but sorting things on disk is common problem with lots of
open libraries out there that are highly optimized.

Then convert this id sorted portion of the file to btree leaf nodes. Then rescan leaf nodes
and build up the inner nodes, recurse until you have single root node left. This is now your
by_id index.

Then rescan the full_doc_infos, and write out to the end of the file the doc_infos and pointers
back to the full_doc_infos. This is already sorted by_seq.

Then convert this seq sorted portion of the file to btree leaf nodes. Then rescan leaf nodes
and build up the inner nodes, recurse until you have single root node left. This is now your
by_seq index.

You now a fully compacted database with no wasted btree nodes. I think this will be a lot
faster. With the exception of the by_id sorting phase, this eliminates random disk seeks.

> more efficient DB compaction (fewer seeks)
> ------------------------------------------
>
>                 Key: COUCHDB-738
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-738
>             Project: CouchDB
>          Issue Type: Improvement
>          Components: Database Core
>    Affects Versions: 0.9.2, 0.10.1, 0.11
>            Reporter: Adam Kocoloski
>            Assignee: Adam Kocoloski
>             Fix For: 1.1
>
>         Attachments: 738-efficient-compaction-v1.patch, 738-efficient-compaction-v2.patch
>
>
> CouchDB's database compaction algorithm walks the by_seq btree, then does a lookup in
the by_id btree for every document in the database.  It does this because the #full_doc_info{}
record with the full revision tree is only stored in the by_id tree.  I'm proposing instead
to store duplicate copies of #full_doc_info{} in both trees, and to have the compactor use
the by_seq tree exclusively.  The net effect is significantly fewer calls to pread(), and
an compaction IO pattern where reads tend to be clustered close to each other in the file.
> If the by_id tree is fully cached, or if the id tree nodes are located near the seq tree
nodes, the performance improvement is small but noticeable (~10% in some simple tests).  On
the other hand, in the worst-case scenario of randomly-generated docids and a database much
larger than main memory the improvement is huge.  Joe Williams did some simple benchmarks
with a 50k document, 600 MB database on a 256MB VPS.  The compaction time for that DB dropped
from 15m to 2m20s, so more than 6x faster.
> Storing the #full_doc_info{} in the seq tree also allows for some similar optimizations
in the replicator.
> This patch might have downsides when documents have a large number of edits.  These include
an increase in the size of the database and slower view indexing.  I expect both to be small
effects.
> The patch can be applied directly to trunk@934272.  Existing DBs are still readable,
new updates will be written in the new format, and databases can be fully upgraded by compacting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message