Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3DFF9618F for ; Tue, 24 May 2011 03:01:48 +0000 (UTC) Received: (qmail 54512 invoked by uid 500); 24 May 2011 03:01:44 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 54473 invoked by uid 500); 24 May 2011 03:01:44 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 54465 invoked by uid 99); 24 May 2011 03:01:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 May 2011 03:01:44 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [195.238.6.179] (HELO mailrelay012.isp.belgacom.be) (195.238.6.179) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 May 2011 03:01:38 +0000 X-Belgacom-Dynamic: yes X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ApEBAPMc201bsjVW/2dsb2JhbAAM8BSGGQSYbIYt Received: from 86.53-178-91.adsl-dyn.isp.belgacom.be (HELO [192.168.1.4]) ([91.178.53.86]) by relay.skynet.be with ESMTP; 24 May 2011 05:01:17 +0200 Message-ID: <4DDB1F7C.1060306@u-mangate.com> Date: Tue, 24 May 2011 05:01:16 +0200 From: Eric Charles User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.17) Gecko/20110414 Thunderbird/3.1.10 MIME-Version: 1.0 To: common-user@hadoop.apache.org Subject: Re: some guidance needed References:

In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Hi, Yes, we need to store immutable mails and their associated r/w metadata. I was wondering in which way a solution like the one presented on [1] can help. Twitter seems to use Protocol Buffers to store tweets. Would a solution based on Avro be a better fit for our needs (mail storage)? In this Avro option, would each "mail" be a avro file, or should be consider to have the "folder" an avro file and run some map/reduce jobs? Tks, - Eric [1] http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter On 19/05/2011 20:53, Robert Burrell Donkin wrote: > On Thu, May 19, 2011 at 12:04 PM, Ioan Eugen Stan wrote: >> I have forwarded this discussion to my mentors so they are informed > > (I've hopped onto this list so no need to remember to copy me into the > thread ;-) > > > >> Eric, one of my mentors, suggested I use Gora for >> this and after a quick look at Gora I saw that it is an ORM for HBase >> and Cassandra which will allow me switch between them. The downside >> with this is that Gora is still incubating so a piece of advice about >> using it or not is welcomed. I will also ask on the Gora mailing list >> to see how things are there. > > (I suspect there will be a measure of experimentation required in this > project, so don't be afraid to try a spike or two) > >>>> I would encourage you to look at a system like HBase for your mail >>>> backend. HDFS doesn't work well with lots of little files, and also >>>> doesn't support random update, so existing formats like Maildir >>>> wouldn't be a good fit. > > (Apache James closer to the Microsoft Exchange space than traditional > *nix mail user agents) > >> I don't think I understand correctly what you mean by random updates. >> E-mails are immutable so once written they are not going to be >> updated. But if you are referring to the fact that lots of (small) >> files will be written in a directory and that this can be a problem >> then I get it. This will also mean that mailbox format (all emails in >> one file) will be more inappropriate than Maildir. But since e-mails >> are immutable and adding a mail to the mailbox means appending a small >> piece of data to the file this should not be a problem if Hadoop has >> append. > > Essentially, there are two classes of data that mail storage requires > > 1. read only MIME documents (mail messages) embedding meta-data (headers) > 2. read-write meta-data sets about each document including flags for > each (virtual) mail directory containing the document > > The documents are searched rarely. The meta-data sets are read often > but written rarely. > > I suspect that emails are relatively small in Hadoop terms, and are > often numerous. Might be interesting to see how a tuned HDFS instance > performs when storing large numbers of small MIME documents. Should be > easy enough to set up an experiment to benchmark. (I wonder whether a > RESTful distributed storage solution might end up working better.) > > I suspect that the read-write meta-data sets will need HBase (or > Cassandra). Would need to think carefully about design, I think. > >> The presentation on Vimeo it stated that HDFS 0.19 did not had append, >> I don't know yet what is the status on that, but things are a little >> brighter. You could have a mailbox file that could grow to a very >> large size. This will lead to all the users emails into one big file >> that is easy to manage, the only thing that it's missing is the >> fetching the emails. Since emails are appended to the file (inbox) as >> they come, and you usually are interested in the latest emails >> received you could just read the tail of the file and do some indexing >> based on that. > > I'm not hopeful about adopting an append based approach. (Might be > made to work but I suspect that the locking required for IMAP or POP3 > is likely to kill performance.) > > Robert