hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Uthayan Suthakar <uthayan.sutha...@gmail.com>
Subject Re: MapReduce job is not picking up appended data.
Date Tue, 27 Jan 2015 12:45:35 GMT
Azuryy, I'm pretty sure that I could 'cat'. Please see below for the
evidence:

(1)
>>>Flume.conf:
a1.sinks.k1.hdfs.rollInterval=3600
a1.sinks.k1.hdfs.batchSize = 10


>>>I sent 21 events and I could 'cat' and verify this:
$ hdfs dfs -cat
/user/mon/input/flume/test/15-01-27/data.2015.01.27.13.1422361100490.tmp |
wc -l
21

>>>But when I submitted MapReduce job on above directory, it only picked
11(batchSize is 10 but it always process an event extra to the size)
records:
Map-Reduce Framework:
Map input records=11


(2)
>>>I then decided to send 9 more events and I could see that they've
appended to the file.
$ hdfs dfs -cat
/user/wdtmon/atlas_xrd_mon/input/flume/test/15-01-27/data.2015.01.27.13.1422361100490.tmp
| wc -l
30

>>>However, when I executed MapReduce job on the file, it still picks only
those 11 events.
Map-Reduce Framework:
Map input records=11


Any idea what's going on?


On 27 January 2015 at 08:30, Azuryy Yu <azuryyyu@gmail.com> wrote:

> Are you sure you can 'cat' the lastest batch of the data on HDFS?
> for Flume, the data is available only after file rolled, because Flume
> only call FileSystem.close() during file rolling.
>
>
> On Mon, Jan 26, 2015 at 8:17 PM, Uthayan Suthakar <
> uthayan.suthakar@gmail.com> wrote:
>
>> I have a Flume which stream data into HDFS sink (appends to same file),
>> which I could "hdfs dfs -cat" and see it from HDFS. However, when I run
>> MapReduce job on the folder that contains appended data, it only picks up
>> the first batch that was flushed (bacthSize = 100) into HDFS. The rest are
>> not being picked up, although I could cat and see the rest. When I execute
>> the MapReduce job after the file is rolled(closed), it's picking up all
>> data.
>>
>> Do you know why MR job is failing to find the rest of the batch even
>> though it exists.
>>
>> So this is what I'm trying to do:
>>
>> 1) Read constant data flow from message queue and write them into HDFS.
>> 2) Rolling is configured by intervals (1 hour) e.g  hdfs.rollinterval
>> =3600
>> 3) Write number of events into file before flushing into HDFS is set to
>> 100 e.g hdfs.BatchSize=100
>> 4) The appending configuration is enabled at lower level e.g
>> hdfs.append.support =true.
>>
>> Snippets from Flume source:
>>
>>  if (conf.getBoolean("hdfs.append.support", false) == true && hdfs.isFile
>>             (dstPath)) {
>>       outStream = hdfs.append(dstPath);
>>     } else {
>>       outStream = hdfs.create(dstPath);
>>     }
>>
>> 5) Now, all configurations for appending data into HDFS are in place.
>> 6) I tested the flume and I could see a hdfs://test/data/input/event1.tmp
>> file get written into HDFS.
>> 7) When I hdfs dfs -cat hdfs://test/data/input/event1.tmp, I could see
>> all data that are being appended into the file e.g 500+ events.
>> 8) However, when I executed a simple MR job to read folder
>> hdfs://test/data/input  , it only picked up the first 100 event, although
>> it had over 500+ events.
>>
>> So it would appear that Flume is in fact appending data into HDFS but MR
>> job is failing to pick up everything, perhaps block caching issue or
>> partition issue? Has anyone come across this issue?
>>
>
>

Mime
View raw message