hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elton Pinto <epti...@gmail.com>
Subject Read After Write Consistency in HDFS
Date Thu, 02 Sep 2010 17:35:05 GMT

I apologize if this topic has already been brought up, but I was unable to
find it by searching around.

We recently discovered in issue in one of our jobs where the output of one
job does not seem to be making it into another job. The first job is a
loader job that's just a map step for asynchronously downloading external
data in multiple threads and then writing to HDFS directly (i.e. not using
the OutputCollector) using FileSystem and FSDataOutputStream. I believe we
did this because we had issues in this situation with writing using the

The job that consumes this data runs directly after taking as the input
directory the output directory of the loader job. Very rarely, it looks like
not all the files are being consumed though, which we assume means that they
weren't yet propagated to HDFS yet. The volume of data being loaded is on
the order of 10 GB.

Our fix that we're working on for this is to append the number of files
(i.e. number of mappers) to the file name and then checking that to ensure
that the actual number of files match expected, but I had a few questions
about this issue:

1) Has anyone else seen anything like this? Is read after write consistency
just not guaranteed on HDFS?
2) Could it be an issue because we're not using an OutputCollector?
3) Does anyone know an easy way to change the file name that the
OutputCollector uses? MultipleTextOutputFormat seems to only take in a
key/value pair to create file names whereas what we really want is the
JobConf so we can get the task number and the total number of tasks. If the
OutputCollector is also affected by this issue, then we have other jobs that
we need to set up this kind of check for.




View raw message