crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hansen,Chuck" <>
Subject Re: Writing MapFile through Crunch, issue reading through Hadoop
Date Mon, 09 Sep 2013 18:18:12 GMT
Thanks for the quick reply Josh.  Is there a way I could use a PathFilter when creating the
MapFile.Reader[] array?

MapFile.Reader[] readers = MapFileOutputFormat.getReaders(new Path(MAPFILE_LOCATION), conf);

Chuck Hansen
Software Engineer, Record Dev<> | 816-201-9629
Cerner Corporation |<>

From: Josh Wills <<>>
Reply-To: "<>" <<>>
Date: Monday, September 9, 2013 12:44 PM
To: "<>" <<>>
Subject: Re: Writing MapFile through Crunch, issue reading through Hadoop

Tough to assign blame here-- writing a _SUCCESS bit is usually a good thing, and most Hadoop
file formats are smart about filtering out files that start with "_" or ".", or allowing you
to specify an instance of PathFilter that can be used to ignore hidden files.

One way around this would be to add an option to Targets that would disable writing the _SUCCESS
flag, which would be part of a more general change to allow per-Source and per-Target configuration
options. For example, you could specify that some outputs of an MR job were compressed using
gzip, and others were compressed using Snappy, instead of having a single compression strategy
for everything.

On Mon, Sep 9, 2013 at 10:28 AM, Hansen,Chuck <<>>
With Crunch versions prior to 0.7.x, there does not appear to be an _SUCCESS file written
upon completion, starting with 0.7.x there is.  This file (and any others not intended to
be read through [1]) appears to cause issue with [1].  This means writing a MapFile with crunch
and reading back with [1] works prior to 0.7.x, but starting with 0.7.x, [1] will throw an

Is this a bug with Crunch and/or Hadoop?

[1] org.apache.hadoop.mapreduce.lib.output.MapFileOutputFormat.getReaders

Hadoop CDH versions used:



Chuck Hansen
Software Engineer, Record Dev<> | 816-201-9629<tel:816-201-9629>
Cerner Corporation |<>
CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation
and are intended only for the addressee. The information contained in this message is confidential
and may constitute inside or non-public information under international, federal, or state
securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such
information is strictly prohibited and may be unlawful. If you are not the addressee, please
promptly delete this message and notify the sender of the delivery error by e-mail or you
may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024<tel:%28%2B1%29%20%28816%29221-1024>.

Director of Data Science
Twitter: @josh_wills<>

View raw message