hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Uthayan Suthakar <uthayan.sutha...@gmail.com>
Subject MapReduce job is not picking up appended data.
Date Mon, 26 Jan 2015 12:17:48 GMT
I have a Flume which stream data into HDFS sink (appends to same file),
which I could "hdfs dfs -cat" and see it from HDFS. However, when I run
MapReduce job on the folder that contains appended data, it only picks up
the first batch that was flushed (bacthSize = 100) into HDFS. The rest are
not being picked up, although I could cat and see the rest. When I execute
the MapReduce job after the file is rolled(closed), it's picking up all

Do you know why MR job is failing to find the rest of the batch even though
it exists.

So this is what I'm trying to do:

1) Read constant data flow from message queue and write them into HDFS.
2) Rolling is configured by intervals (1 hour) e.g  hdfs.rollinterval =3600
3) Write number of events into file before flushing into HDFS is set to 100
e.g hdfs.BatchSize=100
4) The appending configuration is enabled at lower level e.g
hdfs.append.support =true.

Snippets from Flume source:

 if (conf.getBoolean("hdfs.append.support", false) == true && hdfs.isFile
            (dstPath)) {
      outStream = hdfs.append(dstPath);
    } else {
      outStream = hdfs.create(dstPath);

5) Now, all configurations for appending data into HDFS are in place.
6) I tested the flume and I could see a hdfs://test/data/input/event1.tmp
file get written into HDFS.
7) When I hdfs dfs -cat hdfs://test/data/input/event1.tmp, I could see all
data that are being appended into the file e.g 500+ events.
8) However, when I executed a simple MR job to read folder
hdfs://test/data/input  , it only picked up the first 100 event, although
it had over 500+ events.

So it would appear that Flume is in fact appending data into HDFS but MR
job is failing to pick up everything, perhaps block caching issue or
partition issue? Has anyone come across this issue?

View raw message