crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: Non Deterministic Record Drops
Date Tue, 28 Jul 2015 00:47:55 GMT
Hey Jeff,

Are the counts determined by Counters? Or is it the length of the output
files? Or both?


On Mon, Jul 27, 2015 at 5:29 PM, David Ortiz <> wrote:

> Out of curiosity, any reason you went with multiple reads as opposed to
> just performing multiple operations on the same PTable? parallelDo returns
> a new object rather than modifying the initial one, so a single collection
> can start multiple execution flows.
> On Mon, Jul 27, 2015, 8:11 PM Jeff Quinn <> wrote:
>> Hello,
>> We have observed and replicated strange behavior with our crunch
>> application while running on MapReduce via the AWS ElasticMapReduce
>> service. Running a very simple job which is mostly map only, we see that an
>> undetermined subset of records are getting dropped. Specifically, we
>> expect 30,136,686 output records and have seen output on different trials
>> (running over the same data with the same binary):
>> 22,177,119 records
>> 26,435,670 records
>> 22,362,986 records
>> 29,798,528 records
>> These are all the things about our application which might be unusual and
>> relevant:
>> - We use a custom file input format, via From.formattedFile. It looks
>> like this (basically a carbon copy
>> of org.apache.hadoop.mapreduce.lib.input.TextInputFormat):
>> import;
>> import;
>> import org.apache.hadoop.mapreduce.InputSplit;
>> import org.apache.hadoop.mapreduce.RecordReader;
>> import org.apache.hadoop.mapreduce.TaskAttemptContext;
>> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
>> import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
>> import;
>> public class ByteOffsetInputFormat extends FileInputFormat<LongWritable, Text>
>>   @Override
>>   public RecordReader<LongWritable, Text> createRecordReader(
>>       InputSplit split, TaskAttemptContext context) throws IOException,
>>       InterruptedException {
>>     return new LineRecordReader();
>>   }
>> }
>> - We call org.apache.crunch.Pipeline#read using this InputFormat many times, for
the job in question it is called ~160 times as the input is ~100 different files. Each file
ranges in size from 100MB-8GB. Our job only uses this input format for all input files.
>> - For some files org.apache.crunch.Pipeline#read is called twice one the same file,
and the resulting PTables are processed in different ways.
>> - It is only the data from these files which org.apache.crunch.Pipeline#read has
been called on more than once during a job that have dropped records, all other files consistently
do not have dropped records
>> Curious if any Crunch users have experienced similar behavior before, or if any of
these details about my job raise any red flags.
>> Thanks!
>> Jeff Quinn
>> Data Engineer
>> Nuna
>> *DISCLAIMER:* The contents of this email, including any attachments, may
>> contain information that is confidential, proprietary in nature, protected
>> health information (PHI), or otherwise protected by law from disclosure,
>> and is solely for the use of the intended recipient(s). If you are not the
>> intended recipient, you are hereby notified that any use, disclosure or
>> copying of this email, including any attachments, is unauthorized and
>> strictly prohibited. If you have received this email in error, please
>> notify the sender of this email. Please delete this and all copies of this
>> email from your system. Any opinions either expressed or implied in this
>> email and all attachments, are those of its author only, and do not
>> necessarily reflect those of Nuna Health, Inc.

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

View raw message