incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Filipe David Manana <fdman...@apache.org>
Subject Re: Compaction
Date Sat, 10 Mar 2012 18:05:51 GMT
On Thu, Mar 8, 2012 at 9:39 PM, Paul Davis <paul.joseph.davis@gmail.com> wrote:
>
> There's a second set of two patches in BigCouch that I wrote to
> address this specifically. The first patch changes the compactor to
> use a temporary file for the id btree. Then just before compaction
> finishes, this tree is streamed back into the .compact file (in sorted
> order so that internal garbage is minimized). This helps tremendously
> for databases with random document ids (sorted ids are already
> ~optimal for this scheme). The second patch in the set uses an
> external merge sort on the temporary file which helps speed up the
> compaction.
>
> Depending on the dataset these improvements can have massive gains for
> post-compaction data sizes as well as time required for compaction. I
> plan on pulling these back into CouchDB in the coming months as we
> work on merging BigCouch back into CouchDB so hopefully by end of
> summer they'll be in master for everyone to enjoy.
>
> As to views, they don't really require these imrpovements because
> their indexes are always streamed in sorted order. So its both fast
> and close-ish to optimal. Although somewhere I had a patch that
> changed the index builds to be actually optimal based on ideas from
> Filipe but as I recall it wasn't a super huge win so I didn't actually
> commit it.

Yes, about half a year ago I wrote some code to build btrees bottom up
into a new file while folding them from the source file (see [1]).
This ensures the final btree has a fragmentation of 0% besides
speeding the compaction process (not always faster, but at least it's
never slower then the old approach).
It's been used in Couchbase for view compaction since then and working
perfectly fine.
I haven't yet adapted it to CouchDB's code as well.

As for databases, we use it together with a temporary file and
external disk sorting (erlang's file_sorter module) as well (see [2]).
Maybe it's exactly the same approach as you mentioned, however our
file format is very different from CouchDB's one. Besides guaranteeing
0% of fragmentation, it's also much faster for the random IDs case.

[1] - https://github.com/fdmanana/couchdb/commit/45a2956e0534c853d58169d7fd2cea23b3978c03

[2] - https://github.com/couchbase/couchdb/commit/f4f62ac6



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

Mime
View raw message