hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohammad Tariq <donta...@gmail.com>
Subject Re: Block vs FileSplit vs record vs line
Date Thu, 14 Mar 2013 11:42:05 GMT
Just to add to what Manish sir has said, HDFS blocks and MR filesplits are
2 different things. filesplits are just logical division of your data such
that each split goes to a mapper for processing. split creation depends on
the InputFormat you use. but it's not always necessary that for each split
you'll always have an exclusive mapper. for example, if you process a huge
csv file with (say) 1 million rows, you won't get 1 million mappers as
it'll add a lot of overhead. the framework actually tries to do everything
as efficiently as possible.

Warm Regards,

On Thu, Mar 14, 2013 at 4:59 PM, Manish Bhoge <manishbhoge@rocketmail.com>wrote:

> Sai,
> Each file is divided into split as per the map input format, each split is
> equal to a map. You rightly stated 1 split=1 block=1 map. Record can be
> combination of block defined by recordreader code. One record can be series
> of maps or splits or blocks.
> Hope this will clear.
> Sent from HTC via Rocket! excuse typo.
>  ------------------------------
> * From: * Sai Sai <saigraph@yahoo.in>;
> * To: * user@hadoop.apache.org <user@hadoop.apache.org>;
> * Subject: * Re: Block vs FileSplit vs record vs line
> * Sent: * Thu, Mar 14, 2013 8:45:53 AM
>   Just wondering if this is right way to understand this:
> A large file is split into multiple blocks and each block is split into
> multiple file splits and each file split has multiple records and each
> record has multiple lines. Each line is processed by 1 instance of mapper.
> Any help is appreciated.
> Thanks
> Sai

View raw message