crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Quinn <j...@nuna.com>
Subject Non Deterministic Record Drops
Date Tue, 28 Jul 2015 00:11:30 GMT
Hello,

We have observed and replicated strange behavior with our crunch
application while running on MapReduce via the AWS ElasticMapReduce
service. Running a very simple job which is mostly map only, we see that an
undetermined subset of records are getting dropped. Specifically, we
expect 30,136,686 output records and have seen output on different trials
(running over the same data with the same binary):

22,177,119 records
26,435,670 records
22,362,986 records
29,798,528 records

These are all the things about our application which might be unusual and
relevant:

- We use a custom file input format, via From.formattedFile. It looks like
this (basically a carbon copy
of org.apache.hadoop.mapreduce.lib.input.TextInputFormat):

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;

import java.io.IOException;

public class ByteOffsetInputFormat extends FileInputFormat<LongWritable, Text> {

  @Override
  public RecordReader<LongWritable, Text> createRecordReader(
      InputSplit split, TaskAttemptContext context) throws IOException,
      InterruptedException {
    return new LineRecordReader();
  }
}

- We call org.apache.crunch.Pipeline#read using this InputFormat many
times, for the job in question it is called ~160 times as the input is
~100 different files. Each file ranges in size from 100MB-8GB. Our job
only uses this input format for all input files.

- For some files org.apache.crunch.Pipeline#read is called twice one
the same file, and the resulting PTables are processed in different
ways.

- It is only the data from these files which
org.apache.crunch.Pipeline#read has been called on more than once
during a job that have dropped records, all other files consistently
do not have dropped records

Curious if any Crunch users have experienced similar behavior before,
or if any of these details about my job raise any red flags.

Thanks!

Jeff Quinn

Data Engineer

Nuna

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Mime
View raw message