hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gerrit van Vuuren <gvanvuu...@specificmedia.com>
Subject Re: Question about Hadoop task side-effect files//
Date Wed, 09 Jun 2010 07:24:10 GMT
Hi, 

 Using tools frameworks pig and hive already avoids this (unless you write your own stores/writers).
What these do is each mapper or reducer (depending from where you write your final data to)
will write to its own unique file on hdfs. Have a look at the contents of a table in hive
which normally is a folder on hdfs with multiple files. Inserting to a hive table will just
write another file to the folder. 


----- Original Message -----
From: wuxy <wuxy@huawei.com>
To: hive-dev@hadoop.apache.org <hive-dev@hadoop.apache.org>
Sent: Wed Jun 09 07:08:22 2010
Subject: Question about Hadoop task side-effect files//


I found following section at the end of chapter 6 of the book <Hadoop, the
definitive guide>,
--------------------
'Task side-effect files';
"Care needs to be taken to ensure that multiple instances of the same task
don't try to write to the same file. There are two problems to avoid: if a
task failed and was retried, then the old partial output would still be
present when the second task ran, and it would have to delete the old file
first. Second, with speculative execution enabled, two instances of the same
task could try to write to the same file simultaneously." 
-----------------------
In the description: "two instances of the same task could try to write to
the same file simultaneously" is a case should be avoided.
Can anyone confirm this for me, and if possible, tell me the reason below
behind it. 

Thanks.

Steven. Wu





Mime
View raw message