couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Filipe Manana (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (COUCHDB-1132) Track used space of database and view index files
Date Wed, 20 Apr 2011 11:09:05 GMT

    [ https://issues.apache.org/jira/browse/COUCHDB-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022078#comment-13022078
] 

Filipe Manana commented on COUCHDB-1132:
----------------------------------------

I made a few tests with much larger databases, and here follow the results.

* 75Mb database, 12 531 documents

Before compaction:

$ curl http://localhost:5985/testdb1
{"db_name":"testdb1","doc_count":12531,"doc_del_count":0,"update_seq":12531,"purge_seq":0,
"compact_running":false,"disk_size":77545585,"data_size":35483560,
"instance_start_time":"1303288992990482","disk_format_version":5,"committed_update_seq":12531}

After compaction:

$ curl http://localhost:5985/testdb1
{"db_name":"testdb1","doc_count":12531,"doc_del_count":0,"update_seq":12531,"purge_seq":0,
"compact_running":false,"disk_size":41271409,"data_size":35453857,"instance_start_time":"1303288992990482",
"disk_format_version":5,"committed_update_seq":12531}

data size is about 86% of the file size



* 1.8Gb database, 262 531 documents

Before compaction:

$ curl http://localhost:5985/testdb1
{"db_name":"testdb1","doc_count":262531,"doc_del_count":0,"update_seq":262531,"purge_seq":0,
"compact_running":false,"disk_size":1962610801,"data_size":744835248,"instance_start_time":"1303289719133306",
"disk_format_version":5,"committed_update_seq":262531}

After compaction:

$ curl http://localhost:5985/testdb1
{"db_name":"testdb1","doc_count":262531,"doc_del_count":0,"update_seq":262531,"purge_seq":0,
"compact_running":false,"disk_size":1139642481,"data_size":744292081,"instance_start_time":"1303289719133306",
"disk_format_version":5,"committed_update_seq":262531}

data size is about 65% of the file size

After changing compaction checkpoint frequency from 10 000 to 10 000 000 000
and compacting the database again:

$ curl http://localhost:5985/testdb1
{"db_name":"testdb1","doc_count":262531,"doc_del_count":0,"update_seq":262531,"purge_seq":0,
"compact_running":false,"disk_size":1139601521,"data_size":744292168,"instance_start_time":"1303296830183399",
"disk_format_version":5,"committed_update_seq":262531}

data size is still about 65% of the file size

After changing compaction batch size from 1 000 to 100 000

$ curl http://localhost:5985/testdb1
{"db_name":"testdb1","doc_count":262531,"doc_del_count":0,"update_seq":262531,"purge_seq":0,
"compact_running":false,"disk_size":776962161,"data_size":744307523,"instance_start_time":"1303297206958149",
"disk_format_version":5,"committed_update_seq":262531}

data size is now about 96% of the file size


* 16Gb database, 3 341 491 documents

(No data dize before compaction since it was a database created with trunk CouchDB)

After compaction:

$ curl http://localhost:5985/large_1_20
{"db_name":"large_1_20","doc_count":3341491,"doc_del_count":0,"update_seq":3341491,"purge_seq":0,
"compact_running":false,"disk_size":16318431354,"data_size":15069943338,"instance_start_time":"1303296570043058",
"disk_format_version":5,"committed_update_seq":3341491}

data size is about 92% of the file size


This makes me think we should make the compaction checkpoint frequency and batch size configurable
in the .ini (specially the batch size), since this can reduce significantly the final file
size as well as make the compaction a bit faster.
Anyone -1 on doing this?

For view indexes, the batch size is controlled by the size of the work queues, but I believe
Adam and/or Paul were thinking about making this configurable.

> Track used space of database and view index files
> -------------------------------------------------
>
>                 Key: COUCHDB-1132
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1132
>             Project: CouchDB
>          Issue Type: New Feature
>          Components: Database Core
>            Reporter: Filipe Manana
>             Fix For: 1.2
>
>
> Currently users have no reliable way to know if a database or view index compaction is
needed.
> Both me, Adam and Robert Dionne have been working on a feature to compute and expose
the current data size (in bytes) of databases and view indexes. These computations are exposed
as a single field in the database info and view index info URIs.
> Comparing this new value with the disk_size value (the total space in bytes used by the
database or view index file) would allow users to decide whether or not it's worth to trigger
a compaction.
> Adam and Robert's work can be found at:
> https://github.com/cloudant/bigcouch/compare/7d1adfa...a9410e6
> Mine can be found at:
> https://github.com/fdmanana/couchdb/compare/file_space
> After chatting with Adam on IRC, the main difference seems to be that they're work accounts
only for user data (document bodies + attachments), while mine also accounts for the btree
values (including all meta information, keys, rev trees, etc) and the data added by couch_file
(4 bytes length prefix, md5s, block boundary markers).
> An example:
> $ curl http://localhost:5984/btree_db/_design/test/_info
> {"name":"test","view_index":{"signature":"aba9f066ed7f042f63d245ce0c7d870e","language":"javascript","disk_size":274556,"data_size":270455,"updater_running":false,"compact_running":false,"waiting_commit":false,"waiting_clients":0,"update_seq":1004,"purge_seq":0}}
> $ curl http://localhost:5984/btree_db
> {"db_name":"btree_db","doc_count":1004,"doc_del_count":0,"update_seq":1004,"purge_seq":0,"compact_running":false,"disk_size":6197361,"data_size":6186460,"instance_start_time":"1303231080936421","disk_format_version":5,"committed_update_seq":1004}
> This example was executed just after compacting the test database and view index. The
new filed "data_size" has a value very close to the final file size.
> The only thing that my branch doesn't include in the data_size computation, for databases,
are the size of the last header, the size of the _security object and purged revs list - in
practice these are very small and insignificant that adding extra code to account them doesn't
seem worth it.
> I'm sure we can merge the best from both branches.
> Adam, Robert, thoughts?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message