crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Ortiz <>
Subject Re: Multiple file inputs to Maps & HDFS issues
Date Mon, 23 Nov 2015 14:04:42 GMT

      From what I've seen, Crunch will optimize that so each load/filter
action will read each file one time.  So if you have 4 different record
types being processed through that first stage, and you aren't doing a join
or cogroup, expect that file to be read four times.


On Sat, Nov 21, 2015 at 8:44 PM Everett Anderson <> wrote:

> Hi,
> I'm trying to understand a little about how Crunch and Hadoop handle
> multiple file inputs to a Map and if there are multiplicative I/O effects
> that might trouble HDFS.
> We have an extract/transform pipeline in Crunch that we run with the
> Hadoop MapReduce pipeline implementation on AWS EMR 4.1 (Hadoop 2.6.0).
> In our situation, a given file may contain many record types, one per
> line, and we have DoFns and FilterFns that detect and separate out a given
> record type.
> Recently, as we've gotten more data, we've started running into what seems
> like HDFS data node issues -- it seems like we're overloading them and then
> they fail to replicate blocks, leading to job failures.
> One failing case has 4 input files of about 50 GB each. Our Crunch dotfile
> looks like this:
> ‚Äč
> Not shown is the fact that in the middle there's a Crunch union of the 4
> PTables, and it's on the union that the record-specific extractors (W1-W9)
> run.
> In this situation, is the unit of work / shard going into the Map a single
> input split from any one of the 4 files?
> Would Crunch or Hadoop re-read any of the files multiple times?
> Do you see any situation in which more total I/O would be performed that
> just the sum of the input file sizes and the sum of the outputs of W1-W9?
> Thanks!
> - Everett
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.

View raw message