hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Reading from File
Date Wed, 27 Apr 2011 06:49:27 GMT
Hello Mark,

On Wed, Apr 27, 2011 at 12:19 AM, Mark question <markq2011@gmail.com> wrote:
> Hi,
>   My mapper opens a file and read records using next() . However, I want to
> stop reading if there is no memory available. What confuses me here is that
> even though I'm reading record by record with next(), hadoop actually reads
> them in dfs.block.size. So, I have two questions:

The dfs.block.size is a HDFS property, and does not have a rigid
relationship with InputSplits in Hadoop MapReduce. It is used as hints
for constructing offsets and lengths of splits for the RecordReaders
to seek-and-read from and until.

> 1. Is it true that even if I set dfs.block.size to 512 MB, then at least one
> block is loaded in memory for mapper to process (part of inputSplit)?

Blocks are not pre-loaded into memory, they are merely read off the FS
record by record (or buffer by buffer, if you please).

You shouldn't really have memory issues with any of the
Hadoop-provided RecordReaders as long as individual records fit well
into available Task JVM memory grants.

> 2. How can I read multiple records from a sequenceFile at once and will it
> make a difference ?

Could you clarify on what it is you seek here? Do you want to supply
your mappers with N records every call via a sequence file or do you
merely look to do this to avoid some memory issues as stated above?

In case of the former, it would be better if your Sequence Files were
prepared with batched records instead of writing a custom N-line
splitting InputFormat for the SequenceFiles (which will need to
inspect the file pre-submit).

Have I understood your questions right?

Harsh J

View raw message