hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Malcolm Matalka" <mmata...@millennialmedia.com>
Subject RE: Understanding file splits
Date Tue, 28 Oct 2008 19:20:03 GMT
I did a test where I created a free-standing java application that will
just open up one of the URI's and try to read all of it, just as I do in
the RecordReader.  This worked fine and successfully read the entire
file.  The M/R job seems to be getting EOF at the end of the first block
though.

I am using hadoop 0.18.1 by the way.

-----Original Message-----
From: Malcolm Matalka [mailto:mmatalka@millennialmedia.com] 
Sent: Tuesday, October 28, 2008 11:41
To: core-user@hadoop.apache.org
Subject: RE: Understanding file splits

Thanks for the response Owen.

As for the 'isSplittable' thing.  The FAQ calls this function
'isSplittable' but in the API it is actually 'isSplitable'. I am not
sure who to contact to fix the FAQ.  I am extending FileInputFormat in
this case so it was actually returning true.

In this case the output I have given is all for the same file, and 'end'
is start + length.  Pos is a variable I have used to keep track of how
many bytes I have read.  After every successful read I add the record
length (a fixed value) to the pos, in this case pos represents the
position of the last whole record read.

I fixed my code to return false from isSplitable and I got 1 split per
input file, here is the output:
 0 end: 690244390 pos: 67108800 path:
hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/conflated.200
81016/part-00007
 0 end: 690349770 pos: 67108800 path:
hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/conflated.200
81016/part-00005
 0 end: 690385960 pos: 67108800 path:
hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/conflated.200
81016/part-00001
 0 end: 690433590 pos: 67108800 path:
hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/conflated.200
81016/part-00006
 0 end: 690462960 pos: 67108800 path:
hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/conflated.200
81016/part-00011
 0 end: 690557560 pos: 67108800 path:
hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/conflated.200
81016/part-00002
 0 end: 690585720 pos: 67108800 path:
hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/conflated.200
81016/part-00010
 0 end: 690651500 pos: 67108800 path:
hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/conflated.200
81016/part-00004
 0 end: 690687030 pos: 67108800 path:
hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/conflated.200
81016/part-00000
 0 end: 690730700 pos: 67108800 path:
hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/conflated.200
81016/part-00009
 0 end: 691173450 pos: 67108800 path:
hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/conflated.200
81016/part-00003
 0 end: 691200180 pos: 67108800 path:
hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/conflated.200
81016/part-00008 

Note, I print this when I hit EOF on the input file so this is where I
will return false from next in the record reader.  I am creating the
input stream with:

        final Path file = split.getPath();
        path = file;
        start = split.getStart();
        end = start + split.getLength();

        // open the file and seek to the start of the split
        FileSystem fs = file.getFileSystem(job);
        FSDataInputStream fileIn = fs.open(split.getPath());

Which I took from LineRecordReader.

Any thoughts?

Thanks


-----Original Message-----
From: Owen O'Malley [mailto:omalley@apache.org] 
Sent: Tuesday, October 28, 2008 11:30
To: core-user@hadoop.apache.org
Subject: Re: Understanding file splits


On Oct 28, 2008, at 6:29 AM, Malcolm Matalka wrote:

> I am trying to write an InputFormat and I am having some trouble
> understanding how my data is being broken up.  My input is a previous
> hadoop job and I have added code to my record reader to print out the
> FileSplit's start and end position, as well as where the last record I
> read was located.  My record are all about 100 bytes so fairly small.
> For one file I am seeing the following output:
>
>
>
> start: 0 end: 45101881 pos: 67108800
>
> start: 45101880 end: 90203762 pos: 67108810
>
> start: 90203761 end: 135305643 pos: 134217621
>
> start: 135305642 end: 180170980 pos: 180170902

It would help if you printed the FIleSplits themselves, so that we can  
see the file names. I don't know where the "pos" is coming from.  
FileSplits only have offset and length.

> Note, I have also specified in my InputFormat that isSplittable return
> false.

That isn't working. Otherwise, you would only have FileSplits that  
start at 0.

> I do not understand why there is overlap.  Note that on the second  
> one,
> I never appear to reach the end position.

The RecordReaders have to read more than their split because they can  
only process whole records. So, in the case of text files and  
TextInputFormat, the split is picked blindly. Then all of the splits  
that don't start at offset 0, read until they reach a newline. They  
start from there and read until the newline *past* the end of the  
split. That way all of the data is processed and no partial records  
are processed.

-- Owen

Mime
View raw message