hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rekha Joshi <rekha...@yahoo-inc.com>
Subject Re: HDFS and MapReduce and /tmp directory
Date Tue, 06 Apr 2010 05:10:23 GMT
Well, I re-read and it seems you are more interested in the steps of how hdfs reads/writes
data directly or via MR, exactly.
>From what I know depending on your setup(single-node or otherwise),  hdfs storage mechanism,
replication and the namenode-datanode interaction, there will be an intermediate step of writing
to local, checking for data block size(64MB), a checkpoint and data would be eventually persisted.I
have not tried to find which HD exactly yet :)

The copyFromLocal destination folder is the hdfs path you specify. If not fully qualified
path, it will your default hdfs directory.

The file.out.index file is used to get information where the map-output for the given reducer
is available, while file.out is used to get the map output.You might like to see LocalDirAllocator
to get finer details on allocation,disk writability, capacity etc.

just 2 cents,

On 4/5/10 6:10 PM, "psdc1978" <psdc1978@gmail.com> wrote:

Yes, I know that, but this answer won't answer my questions.

On Mon, Apr 5, 2010 at 12:31 PM, Rekha Joshi <rekhajos@yahoo-inc.com> wrote:
In manner of providing a quick byte,  /tmp folder( check hadoop.tmp.dir) is only temporarily
used by MR process and they are ideally cleaned up after the job has finished execution on
MR is a process which loads/stores data in HDFS. Most of your queries relate to knowing your
default hdfs location. You can find that by "hadoop dfs -ls".The path preceding .Trash is
your default hdfs location.


On 4/5/10 4:24 PM, "psdc1978" <psdc1978@gmail.com <http://psdc1978@gmail.com> >


When I run an MapReduce example, I've noticed that some temporary directories are buit in
/tmp directory.

In my case, in the /tmp/hadoop directory it was created the following file directory during
the execution of wordcount example:

|-- attempt_201004041803_0002_m_000000_0_0_m_0
|   |-- job.xml
|   |-- output
|   |   |-- file.out
|   |   `-- file.out.index
|   |-- pid
|   `-- split.dta

1 - In the map attempt task it exists a file.out and split.dta file.The split.dta is the map
output produced by the map and that will be fetched by the reducer?

2 - What's the file.out and file.out.index?

3 - Is this data were written by MR anything related to HDFS?

4 - I'm a bit confused to differentiate between the files that are written in /tmp directory
during the execution of my example, and the place where the files are written with the command
"bin/hadoop dfs -copyFromLocal".

a) When I execute the "bin/hadoop dfs -copyFromLocal <from> <to>" command, where's
the destination folder?

b) Is it in memory or is physically in my HD?

c) If the files are written in the HD, in wich directory are they?

d) What is the difference between the data written win the command -copyFromLocal and the
data written in the /tmp directory?

5 - The output of a reducer example comes in the form part_0000 that is written in gutenberg-output.
Where is this file? Is it in my HD?

Thank you,

View raw message