hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: MapReduce MIME Input type?
Date Tue, 31 Dec 2013 06:08:26 GMT
Hey Devin,

Are you perhaps looking for http://james.apache.org/mime4j/? You may have
to adapt it for MR but I don't imagine that would be too difficult to do.

On Mon, Dec 30, 2013 at 11:59 PM, Devin Suiter RDX <dsuiter@rdx.com> wrote:

> Hi,
> I am trying to puzzle this out, and am hoping for some insight - I have an
> IMAP inbox dump that I am analyzing - I need to track how many times a
> given item is referred to in the inbox, i.e. how many emails came in about
> that thing and over what time. I can load it into MapReduce as
> TextInputFormat and parse it properly, and have managed to crudely
> concatenate lines that represent an email together as my final output, so,
> basically, it is working now, but my program is seeing each line as an
> InputSplit, and I so it is only working reliably with one InputFileSplit.
> If I had a bigger file, with multiple InputFileSplits presenting
> line-by-line InputSplits, I have no way to be sure that the lines that make
> one email will not end up in two different splits - does that make sense?
> Someone I work with suggested that I attempt to read each email as a
> record, since they have their MIME encoding intact in the text dump, rather
> than each line as a record.
> Does anyone know of a MIME MapReduce input type? I can't be sure this will
> help anyway, since the file is already text-encoded - I may have to get the
> email from the original inbox as individual messages somehow to utilize the
> MIME header information.
> Googling this has been challenging, mainly because the words you have to
> use are somewhat overloaded - but I am finding some good clown schools in
> my research...so, any help is appreciated.
> Thanks!
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com

Harsh J

View raw message