hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gang Luo <lgpub...@yahoo.com.cn>
Subject Re: InputSplit and RecordReader
Date Fri, 20 Aug 2010 13:26:50 GMT
right.

-Gang




----- 原始邮件 ----
发件人: Mark <static.void.dev@gmail.com>
收件人: common-user@hadoop.apache.org
发送日期: 2010/8/20 (周五) 1:58:29 上午
主   题: Re: InputSplit and RecordReader

On 8/19/10 7:18 PM, Gang Luo wrote:
> The size of a input spilt could be different from a block. You can specify 
> max/min size of input splits. 
>
>
> InputSplit is actually metadata indicating the start point in a file, the 
>length 
>
> of the split, etc. It doesn't present you the real data. A mapper, when 
>assigned 
>
> a split to process, will read the input as specified in the InputSplit. It can 

> accross the boundary if needed.
>
> -Gang
>
>
>
>
> ----- 原始邮件 ----
> 发件人: Mark <static.void.dev@gmail.com>
> 收件人: common-user@hadoop.apache.org
> 发送日期: 2010/8/19 (周四) 9:47:56 下午
> 主   题: InputSplit and RecordReader
>
> From what I understand the InputSplit is a byte slice of a particular file 
>which 
>
> is then handed off to an individual mapper for processing. Is the size of the 
> InputSplit equal to the hadoop block ie 64/128mb? If not, what is the size.
>
> Now the RecordReaders takes in bytes from the InputSplit and transforms that to 
>
> a record-oriented structure suitable for use within a mapper.. ie key/value 
> correct? Now the wiki says its the RecordReaders job is to respect record 
> boundaries.. how is this accomplished? Say I have an InplutSplit which is 100kb 
>
> in size and each record is approximately 30kb in size. What happens to the last 
>
> 10kb in this example? I believe I read somewhere that it will read past that 
> boundary but how is that possible if the RecordReader has only been presented 
> with 100kb?
>
> Can someone please clarify some of these issues for me. Thanks
>
>
>
>      
Ok so that makes a little more sense. Basically an InputSplit says
"start at offest x and read about y bytes" and then the RecordReader
would basically increase that size to finish the last record. Is this
along the right lines?



      

Mime
View raw message