hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Azuryy Yu <azury...@gmail.com>
Subject Re: MapReduce job is not picking up appended data.
Date Tue, 27 Jan 2015 08:30:30 GMT
Are you sure you can 'cat' the lastest batch of the data on HDFS?
for Flume, the data is available only after file rolled, because Flume only
call FileSystem.close() during file rolling.

On Mon, Jan 26, 2015 at 8:17 PM, Uthayan Suthakar <
uthayan.suthakar@gmail.com> wrote:

> I have a Flume which stream data into HDFS sink (appends to same file),
> which I could "hdfs dfs -cat" and see it from HDFS. However, when I run
> MapReduce job on the folder that contains appended data, it only picks up
> the first batch that was flushed (bacthSize = 100) into HDFS. The rest are
> not being picked up, although I could cat and see the rest. When I execute
> the MapReduce job after the file is rolled(closed), it's picking up all
> data.
> Do you know why MR job is failing to find the rest of the batch even
> though it exists.
> So this is what I'm trying to do:
> 1) Read constant data flow from message queue and write them into HDFS.
> 2) Rolling is configured by intervals (1 hour) e.g  hdfs.rollinterval =3600
> 3) Write number of events into file before flushing into HDFS is set to
> 100 e.g hdfs.BatchSize=100
> 4) The appending configuration is enabled at lower level e.g
> hdfs.append.support =true.
> Snippets from Flume source:
>  if (conf.getBoolean("hdfs.append.support", false) == true && hdfs.isFile
>             (dstPath)) {
>       outStream = hdfs.append(dstPath);
>     } else {
>       outStream = hdfs.create(dstPath);
>     }
> 5) Now, all configurations for appending data into HDFS are in place.
> 6) I tested the flume and I could see a hdfs://test/data/input/event1.tmp
> file get written into HDFS.
> 7) When I hdfs dfs -cat hdfs://test/data/input/event1.tmp, I could see all
> data that are being appended into the file e.g 500+ events.
> 8) However, when I executed a simple MR job to read folder
> hdfs://test/data/input  , it only picked up the first 100 event, although
> it had over 500+ events.
> So it would appear that Flume is in fact appending data into HDFS but MR
> job is failing to pick up everything, perhaps block caching issue or
> partition issue? Has anyone come across this issue?

View raw message