Return-Path: X-Original-To: apmail-couchdb-user-archive@www.apache.org Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ACD997E20 for ; Tue, 27 Dec 2011 09:55:58 +0000 (UTC) Received: (qmail 81340 invoked by uid 500); 27 Dec 2011 09:55:56 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 81308 invoked by uid 500); 27 Dec 2011 09:55:56 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 81300 invoked by uid 99); 27 Dec 2011 09:55:56 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Dec 2011 09:55:56 +0000 X-ASF-Spam-Status: No, hits=0.4 required=5.0 tests=FROM_LOCAL_NOVOWEL,HK_RANDOM_ENVFROM,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of cgsmcmlxxv@gmail.com designates 74.125.83.52 as permitted sender) Received: from [74.125.83.52] (HELO mail-ee0-f52.google.com) (74.125.83.52) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Dec 2011 09:55:48 +0000 Received: by eeke52 with SMTP id e52so12114645eek.11 for ; Tue, 27 Dec 2011 01:55:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=pjMGnb7IhkNrKMW0sIdaOwoXVfr9I/IM0lHccMo/6Qk=; b=P8J2BUVVIMyz4X4PVawYyg3du+6AdnBbKnlvG/eOd2OgRDui1YocGBgqOkAGJ4qgjE u6u3tZtUzX/Zcx1a7nBzndR0p2ecH5pxY9nLoOWEnHAhip7ax1WoQ6DdMPhn8RJoX9ew fWc9LQa0QeSys/ZuajulXkd59S5es7Y9t3oA8= Received: by 10.14.49.133 with SMTP id x5mr11581492eeb.35.1324979727634; Tue, 27 Dec 2011 01:55:27 -0800 (PST) Received: from [192.168.1.123] (095160203004.wroclaw.vectranet.pl. [95.160.203.4]) by mx.google.com with ESMTPS id 13sm104247590eeu.1.2011.12.27.01.55.25 (version=SSLv3 cipher=OTHER); Tue, 27 Dec 2011 01:55:26 -0800 (PST) Message-ID: <4EF9960C.7020000@gmail.com> Date: Tue, 27 Dec 2011 10:55:24 +0100 From: CGS User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.23) Gecko/20110922 Thunderbird/3.1.15 MIME-Version: 1.0 To: user@couchdb.apache.org Subject: Re: Database size seems off even after compaction runs. References: <83E168FD-649E-44D1-B2B1-3E9AE728CE98@couchbase.com> <4EF478AA.1080807@gmail.com> <4EF509D3.20506@gmail.com> <58BAFFA6-FAF3-41BF-BC73-FD97878DA83D@couchbase.com> In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org I understand your confusion about documentation, but I can say, this behavior is documented. If you read carefully the official documentation, you will see the example with the bank in which if one transaction is not successfully, the transaction is not erased, but updated. This is, in this case, if one deletes a document, it's just an update of the old document (i.e., giving a new revision and marking it as deleted). So, in other words, "nothing is lost, everything is transformed." There is no point to be concerned about security here because if you use HTTP predicate GET on the document, CouchDB will return a JSON of a form {"error":"not_found","reason":"deleted"} (compared with a non-existing document which is reported with the appropriate reason, "document does not exist" or so - I don't remember now the message, but I know it's quite meaningful). Even in the case of somebody breaking into your server and obtaining the file or admin password, after compaction, all the previous versions of the document are no longer there, so, no data can be extracted from there. Nevertheless, as it was said here, there are few distinct cases: 1. Using DELETE predicate from HTTP. That will ensure the minimum data are written on the harddisk. 2. Using "_deleted":true in combination with HTTP PUT/POST. If no other data are added to the document while sending the request, it has the same effect as the first point. 3. Emptying the document. This will reduce the document size even more, but it will not allow you to reuse the document unless you provide the correct revision of the document (in the other options, no revision is required). My point in enumerating these option is related to their usage. If you can afford one HTTP request at the time, then using DELETE is probably the best option. But, in many cases, that is a luxury you cannot afford because of the harddisk writing speed limitation. In most of the cases, you would like to use bulk operations. That means buffering your data. At this time, option 1 is no longer available. As you can see, each of the options has its own advantages, but also disadvantages/limitations. But that is another story already. This choice of such a behavior has two major pros: 1. History. If you delete a document which you need it later on, the undo action can be done easily by reverting the document revision to the previous one (providing that no compaction was triggered in between the two actions). 2. Harddisk write speed optimization. If you delete a document and you want to reuse the name later, in the case of the pointer toward the document being simply deleted, then you need mandatory to trigger a compaction to avoid document name conflict. And that is a much slower process than just updating a document. The only way to delete completely a document is to re-create the physical file containing the database. But if this is more annoying than few extra-bytes per document, then leave the "tombstone" there. If both of the previously mentioned options are not convenient for your project, then CouchDB may not be what you need (I am not discouraging people to use CouchDB, but only stating the fact that there is no gain without pain, and using CouchDB is quite a gain in my opinion). Nevertheless, to be kept in mind that there is a way to reclaim the physical space kept by the deleted documents. And two more things I would like to clarify from my previous messages: 1. "making the document unavailable" meant the HTTP GET will return "error" in the case of trying to access a deleted document; 2. when I was speaking about my design for the given case, I stated that there are limitations in the specified design (e.g., race condition and how often you can trigger such a switch), so, one can invent another design based on the information (as I said before) that deleting a document completely can be done only by re-creating the database filtering out the deleted documents (e.g., no "crazy storage blowouts" if you use a round-robin on all your databases, just temporary inconvenience of adding some extra-space to your server system - PC, cluster... - while you perform the space reclaiming procedure). CGS On 12/25/2011 01:10 AM, Daniel Bryan wrote: > I understand if this is necessary for eventual consistency, but shouldn't > this be better-documented? I generally expected that if I delete sensitive > or unwanted data, or that a user requests that their personal or private > data be deleted, it'll be deleted in a way that's more solid than basically > hiding it. Sure, CouchDB won't let you get at that document, but it's > certainly still there on the disk, and presumably detectable if you > inspected the data structure that holds individual documents. Not a very > good situation vis a vis security. I know that normal unix "deletion" > leaves files technically on disk, but there are ways to allow for that and > prevent it from being an issue. > > Even setting data security aside, I've been using CouchDB as a kind of > staging environment for large amounts of data which should ultimately be > elsewhere (different flavours relational databases, databases belonging to > different organisations, etc.) because it's really easy to implement as an > interface and let people just throw whatever they want into it with a POST. > It's really the perfect tool for that, but pretty soon there'll be tens of > gigabytes a day of data flowing through the system, and most of it just > needs to be indexed for a while before our scheduled scripts pull it all > out, shove it elsewhere and delete it. In this use case, if I'm > understanding this correctly, we'll get crazy storage blowouts unless we > implement a bunch of hacks to switch to new databases after performing > deletions (as well as scripts that make our HTTP reverse proxy > transparently and intelligently route data to the new database - absolutely > not a trivial task in any complex system with many moving parts). > > But you know, this all comes with the territory. If the devs say there's a > good reason for documents to stick around after deletion, I believe them, > but I think that's a pretty huge point and I don't know how I've missed it. > > What's the way to delete a document if I actually want to really delete the > data? Changing it to a blank document before deleting, and then compacting? > > On Sat, Dec 24, 2011 at 2:37 PM, Jens Alfke wrote: > >> On Dec 23, 2011, at 4:09 PM, Mark Hahn wrote: >> >>> 1) How exactly could you make this switch without interrupting service? >> Replicate database to new db, then atomically switch your proxy or >> whatever to the new db from the old one. >> Depending on how long the replication takes, there�s a race condition here >> where changes made to the old db during the replication won�t be propagated >> to the new one; you could either repeat the process incrementally until >> this doesn�t happen, or else put the db into read-only mode while you�re >> doing the copy. >> >> This might also be helpful: http://tinyurl.com/89lr3fl >> >>> 2) Wouldn't this procedure create the exact same eventual consistency >>> problems that deleting documents in a db would? >> No; what�s necessary is the revision tree, and the replication will >> preserve that. You�re just losing the contents of the deleted revisions >> that accidentally got left behind because of the weird way the documents >> were deleted. >> >> �Jens >> >>