hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Charles <eric.char...@u-mangate.com>
Subject Re: some guidance needed
Date Tue, 24 May 2011 03:01:16 GMT
Hi,

Yes, we need to store immutable mails and their associated r/w metadata.

I was wondering in which way a solution like the one presented on [1] 
can help. Twitter seems to use Protocol Buffers to store tweets.

Would a solution based on Avro be a better fit for our needs (mail storage)?

In this Avro option, would each "mail" be a avro file, or should be 
consider to have the "folder" an avro file and run some map/reduce jobs?

Tks,

- Eric

[1] 
http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter


On 19/05/2011 20:53, Robert Burrell Donkin wrote:
> On Thu, May 19, 2011 at 12:04 PM, Ioan Eugen Stan<stan.ieugen@gmail.com>  wrote:
>> I have forwarded this discussion to my mentors so they are informed
>
> (I've hopped onto this list so no need to remember to copy me into the
> thread ;-)
>
> <snip>
>
>> Eric, one of my mentors, suggested I use Gora for
>> this and after a quick look at Gora I saw that it is an ORM for HBase
>> and Cassandra which will allow me switch between them. The downside
>> with this is that Gora is still incubating so a piece of advice about
>> using it or not is welcomed. I will also ask on the Gora mailing list
>> to see how things are there.
>
> (I suspect there will be a measure of experimentation required in this
> project, so don't be afraid to try a spike or two)
>
>>>> I would encourage you to look at a system like HBase for your mail
>>>> backend. HDFS doesn't work well with lots of little files, and also
>>>> doesn't support random update, so existing formats like Maildir
>>>> wouldn't be a good fit.
>
> (Apache James closer to the Microsoft Exchange space than traditional
> *nix mail user agents)
>
>> I don't think I understand correctly what you mean by random updates.
>> E-mails are immutable so once written they are not going to be
>> updated. But if you are referring to the fact that lots of (small)
>> files will be written in a directory and that this can be a problem
>> then I get it. This will also mean that mailbox format (all emails in
>> one file) will be more inappropriate than Maildir. But since e-mails
>> are immutable and adding a mail to the mailbox means appending a small
>> piece of data to the file this should not be a problem if Hadoop has
>> append.
>
> Essentially, there are two classes of data that mail storage requires
>
> 1. read only MIME documents (mail messages) embedding meta-data (headers)
> 2. read-write meta-data sets about each document including flags for
> each (virtual) mail directory containing the document
>
> The documents are searched rarely. The meta-data sets are read often
> but written rarely.
>
> I suspect that emails are relatively small in Hadoop terms, and are
> often numerous. Might be interesting to see how a tuned HDFS instance
> performs when storing large numbers of small MIME documents. Should be
> easy enough to set up an experiment to benchmark. (I wonder whether a
> RESTful distributed storage solution might end up working better.)
>
> I suspect that the read-write meta-data sets will need HBase (or
> Cassandra). Would need to think carefully about design, I think.
>
>> The presentation on Vimeo it stated that HDFS 0.19 did not had append,
>> I don't know yet what is the status on that, but things are a little
>> brighter. You could have a mailbox file that could grow to a very
>> large size. This will lead to all the users emails into one big file
>> that is easy to manage, the only thing that it's missing is the
>> fetching the emails. Since emails are appended to the file (inbox) as
>> they come, and you usually are interested in the latest emails
>> received you could just read the tail of the file and do some indexing
>> based on that.
>
> I'm not hopeful about adopting an append based approach. (Might be
> made to work but I suspect that the locking required for IMAP or POP3
> is likely to kill performance.)
>
> Robert


Mime
View raw message