couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From CGS <cgsmcml...@gmail.com>
Subject Re: Database size seems off even after compaction runs.
Date Tue, 27 Dec 2011 09:55:24 GMT
I understand your confusion about documentation, but I can say, this 
behavior is documented. If you read carefully the official 
documentation, you will see the example with the bank in which if one 
transaction is not successfully, the transaction is not erased, but 
updated. This is, in this case, if one deletes a document, it's just an 
update of the old document (i.e., giving a new revision and marking it 
as deleted). So, in other words, "nothing is lost, everything is 
transformed." There is no point to be concerned about security here 
because if you use HTTP predicate GET on the document, CouchDB will 
return a JSON of a form {"error":"not_found","reason":"deleted"} 
(compared with a non-existing document which is reported with the 
appropriate reason, "document does not exist" or so - I don't remember 
now the message, but I know it's quite meaningful). Even in the case of 
somebody breaking into your server and obtaining the file or admin 
password, after compaction, all the previous versions of the document 
are no longer there, so, no data can be extracted from there.

Nevertheless, as it was said here, there are few distinct cases:
1. Using DELETE predicate from HTTP. That will ensure the minimum data 
are written on the harddisk.
2. Using "_deleted":true in combination with HTTP PUT/POST. If no other 
data are added to the document while sending the request, it has the 
same effect as the first point.
3. Emptying the document. This will reduce the document size even more, 
but it will not allow you to reuse the document unless you provide the 
correct revision of the document (in the other options, no revision is 
required).

My point in enumerating these option is related to their usage. If you 
can afford one HTTP request at the time, then using DELETE is probably 
the best option. But, in many cases, that is a luxury you cannot afford 
because of the harddisk writing speed limitation. In most of the cases, 
you would like to use bulk operations. That means buffering your data. 
At this time, option 1 is no longer available.

As you can see, each of the options has its own advantages, but also 
disadvantages/limitations. But that is another story already.

This choice of such a behavior has two major pros:
1. History. If you delete a document which you need it later on, the 
undo action can be done easily by reverting the document revision to the 
previous one (providing that no compaction was triggered in between the 
two actions).
2. Harddisk write speed optimization. If you delete a document and you 
want to reuse the name later, in the case of the pointer toward the 
document being simply deleted, then you need mandatory to trigger a 
compaction to avoid document name conflict. And that is a much slower 
process than just updating a document.

The only way to delete completely a document is to re-create the 
physical file containing the database. But if this is more annoying than 
few extra-bytes per document, then leave the "tombstone" there. If both 
of the previously mentioned options are not convenient for your project, 
then CouchDB may not be what you need (I am not discouraging people to 
use CouchDB, but only stating the fact that there is no gain without 
pain, and using CouchDB is quite a gain in my opinion). Nevertheless, to 
be kept in mind that there is a way to reclaim the physical space kept 
by the deleted documents.

And two more things I would like to clarify from my previous messages:
1. "making the document unavailable" meant the HTTP GET will return 
"error" in the case of trying to access a deleted document;
2. when I was speaking about my design for the given case, I stated that 
there are limitations in the specified design (e.g., race condition and 
how often you can trigger such a switch), so, one can invent another 
design based on the information (as I said before) that deleting a 
document completely can be done only by re-creating the database 
filtering out the deleted documents (e.g., no "crazy storage blowouts" 
if you use a round-robin on all your databases, just temporary 
inconvenience of adding some extra-space to your server system - PC, 
cluster... - while you perform the space reclaiming procedure).

CGS






On 12/25/2011 01:10 AM, Daniel Bryan wrote:
> I understand if this is necessary for eventual consistency, but shouldn't
> this be better-documented? I generally expected that if I delete sensitive
> or unwanted data, or that a user requests that their personal or private
> data be deleted, it'll be deleted in a way that's more solid than basically
> hiding it. Sure, CouchDB won't let you get at that document, but it's
> certainly still there on the disk, and presumably detectable if you
> inspected the data structure that holds individual documents. Not a very
> good situation vis a vis security. I know that normal unix "deletion"
> leaves files technically on disk, but there are ways to allow for that and
> prevent it from being an issue.
>
> Even setting data security aside, I've been using CouchDB as a kind of
> staging environment for large amounts of data which should ultimately be
> elsewhere (different flavours relational databases, databases belonging to
> different organisations, etc.) because it's really easy to implement as an
> interface and let people just throw whatever they want into it with a POST.
> It's really the perfect tool for that, but pretty soon there'll be tens of
> gigabytes a day of data flowing through the system, and most of it just
> needs to be indexed for a while before our scheduled scripts pull it all
> out, shove it elsewhere and delete it. In this use case, if I'm
> understanding this correctly, we'll get crazy storage blowouts unless we
> implement a bunch of hacks to switch to new databases after performing
> deletions (as well as scripts that make our HTTP reverse proxy
> transparently and intelligently route data to the new database - absolutely
> not a trivial task in any complex system with many moving parts).
>
> But you know, this all comes with the territory. If the devs say there's a
> good reason for documents to stick around after deletion, I believe them,
> but I think that's a pretty huge point and I don't know how I've missed it.
>
> What's the way to delete a document if I actually want to really delete the
> data? Changing it to a blank document before deleting, and then compacting?
>
> On Sat, Dec 24, 2011 at 2:37 PM, Jens Alfke<jens@couchbase.com>  wrote:
>
>> On Dec 23, 2011, at 4:09 PM, Mark Hahn wrote:
>>
>>> 1) How exactly could you make this switch without interrupting service?
>> Replicate database to new db, then atomically switch your proxy or
>> whatever to the new db from the old one.
>> Depending on how long the replication takes, there’s a race condition here
>> where changes made to the old db during the replication won’t be propagated
>> to the new one; you could either repeat the process incrementally until
>> this doesn’t happen, or else put the db into read-only mode while you’re
>> doing the copy.
>>
>> This might also be helpful: http://tinyurl.com/89lr3fl
>>
>>> 2) Wouldn't this procedure create the exact same eventual consistency
>>> problems that deleting documents in a db would?
>> No; what’s necessary is the revision tree, and the replication will
>> preserve that. You’re just losing the contents of the deleted revisions
>> that accidentally got left behind because of the weird way the documents
>> were deleted.
>>
>> —Jens
>>
>>


Mime
View raw message