Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 83189 invoked from network); 12 Feb 2009 16:19:59 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 12 Feb 2009 16:19:59 -0000 Received: (qmail 90659 invoked by uid 500); 12 Feb 2009 16:19:53 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 90623 invoked by uid 500); 12 Feb 2009 16:19:53 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 90612 invoked by uid 99); 12 Feb 2009 16:19:53 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Feb 2009 08:19:53 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of paul.joseph.davis@gmail.com designates 74.125.44.28 as permitted sender) Received: from [74.125.44.28] (HELO yx-out-2324.google.com) (74.125.44.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Feb 2009 16:19:43 +0000 Received: by yx-out-2324.google.com with SMTP id 31so372835yxl.5 for ; Thu, 12 Feb 2009 08:19:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=NQiVPDQt2YC9MQ9GAeF4+rkZ9lx9oSzay0nEXzODPYE=; b=OaqtEAhjgKoI++Ztt89A+AofTP6Bxfzy78kSwu4iDCQaYI8RytcfD5QZYBaAWFfmWH 2UD4Z5qotw3OT4s5u0+sj4XQW2ffum5sehCwqzAbXbEDU1GPChBUW5ba/taKnUoD7MvH 9Sfng3849dcXEyQ4XJOvTvWd+lYIEj6hVoCzs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=Aj3LZsHvLsuOFN1pQmsMUr18VrqeLG140qgI9mtYDM3ZyeblktxHqzwKTPghQT3gjX nUe9lrsQsW9m9QcLgGPvORKwZQcWYbZqJlvJu32Y4bGu3BuYg+28nVaM4135zVnXyzuI Rokc81TpEpUAfLn/6eDa/3jX6UcK/qS26eJcU= MIME-Version: 1.0 Received: by 10.100.134.10 with SMTP id h10mr1227242and.68.1234455561792; Thu, 12 Feb 2009 08:19:21 -0800 (PST) In-Reply-To: References: <4FE341EC-2356-4D03-A07D-C9001D2E5CCC@apache.org> Date: Thu, 12 Feb 2009 11:19:21 -0500 Message-ID: Subject: Re: Couch as a mail store? From: Paul Davis To: user@couchdb.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On Thu, Feb 12, 2009 at 10:51 AM, Kenneth Kalmer wrote: > On Thu, Feb 12, 2009 at 4:26 PM, Jan Lehnardt wrote: > >> >> On 12 Feb 2009, at 15:13, Kenneth Kalmer wrote: >> >> Hi everyone >>> >>> This is my first post, please be gentle as I risk ridicule. I've been >>> lurking here for several months now, learning from others. Disclaimer, I >>> have yet do do more with couch than updating and running the tests. >>> >> >> No worries, we don't bite (...usually :). >> > > I've noticed :) > > How would couch fair as a backend for a mail delivery system (in concept)? >> > > Two words: Perfect match. >> > > My reason for persuing the concept. > > Considering you need high availability and very fast IO. Documents (email >> messages) will be created and deleted very often, some almost >> instantaneously. >> >> Couch has some great attributes that makes it sound worth exploring >> further: >> >> * Fast lookup of documents >> * Awesome replication for business continuity (especially in a low-latency >> environment like GIG-E) >> * Scales horizontally >> * Ability to pull entire mailbox for user as one result, or at least bundle >> X emails together in one response >> >> I can't recall seeing any thread on here in recent history discussing high >> document deletion rates, which is effectively the case when people pop >> their >> mail. >> > > A deletion is effectively a set-deleted-flag operation. Compaction then >> takes >> care of getting rid of the file. >> > > So taken from other threads, you're effectively tasked with running > compaction outside of your peak time. This is a no brainer if the other > benefits are in reachable. > > While I'm here, can the docs still be recovered before compaction? Why I ask > is that it would be a bonus to be able to access the mails and do some > statistical reporting before compacting the database, if not, no issues. > Mail server admins (and those footing the bills) love excessive reporting... > Old versions of docs can be recovered exactly until you compact. Also, in terms of statistical reporting, check out Jan's stats additions that will help reporting on the internals of CouchDB itself (think # of requests, time per request, etc). > > Normal filesystem-based storage of mail has other issues: >> >> * Messages often smaller than ethernet jumbo frames, so limited throughput >> (couch can overcome this by bundling messages in a single response) >> * Mostly limited by disk IO and clever tricks around solid state drive >> usage >> or stripping excessively fast disks >> >> Lets assume nothing about existing mail stores, except that filesystem ones >> don't scale will, and I don't even want to consider the possibility of >> raping an RDBMS for this. >> >> Everything is exploratory, the thought just crept into my mind a couple of >> days ago and I'd like to bounce the idea around with everyone for fun. >> >> Thanks for all the hard work, and everyones patience with newbies and >> attackers alike. >> > > Hey, thanks for the nice words :) >> >> Hmm, not too much information. Let's see, if you have any more specific >> questions, just send a follow up :) >> > > Well, lets try and keep this as close as couch as we can and not wander off > into the nasty world of email systems (except for effectively CRUD-ing > messages). > > So mail arrives at our SMTP server. What would give us the best performance > for ingesting mail, directly writing each doc as it arrives, or having small > queues that empty out every X messages / Y seconds (whichever comes first)? > Considering one of our mail clouds does about 15GB an hour during office > hours. I know this size isn't anything when you consider larger providers, > but we're growing constantly and some time in the future we're gonna have to > become creative in how we store mail. > Using _bulk_docs gives you a direct RAM vs. Speed trade off. The bigger you can make single inserts the more efficient the entire system will be. Obviously you'll have to balance that with latency concerns, but at 15 GiB an hour I'd imagine you'll be hitting RAM limits before latency is a factor. > Retrieving mail also becomes interesting, we can use one view to get the > total number of messages for the mailbox, and then another (with parameters) > to batch them from couchdb as we deliver them to the client. Would bulk > updates here be the cheapest way of "mark all as read" or "delete", or would > you again handle documents individually? > > Best > > > -- > Kenneth Kalmer > kenneth.kalmer@gmail.com > http://opensourcery.co.za > Remember to report any numbers you find back to the list. We like to have real world examples to point at to give new people a feeling for what type of stuff they can throw at CouchDB. And it helps with bragging too. :D HTH, Paul Davis