hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gil Vernik (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-6854) Each map task should create a unique temporary name that includes an object name
Date Sun, 05 Mar 2017 09:20:32 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-6854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gil Vernik updated MAPREDUCE-6854:
----------------------------------
    Description: 
Consider an example: a local file "/data/a.txt"  need to be copied into swift://container.service/data/a.txt

The way distcp works is that first it will upload "/data/a.txt" into swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_000000_0

Upon completion distcp will move   swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_000000_0
into swift://container.mil01/data/a.txt
************************************
The temporary file naming convention assumes that each map task will sequentially create objects
as swift://container.mil01/.distcp.tmp.attempt_ID
and then rename them to the final names.  Most of Hadoop eco system components use object.name
which is part of the temporary name, however distcp doesn't use such approach. 

This JIRA propose to add a configuration key indicating that temporary objects will also include
object name as part of their temporary file name,

For example
"/data/a.txt" will be uploaded into 
"swift://container.mil01/data/a.txt.distcp.tmp.attempt_local2036034928_0001_m_000000_0"

"a.txt.distcp.tmp.attempt_local2036034928_0001_m_000000_0" doesn't affects flows in the access
drivers, since "a.txt" is not considered as a sub-directory so no special operations will
be taken. The benefit is that different systems may expect "a.txt.distcp.tmp.attempt_local2036034928_0001_m_000000_0"
and extract value prior "distcp.tmp"

  was:
Consider an example: a local file "/data/a.txt"  need to be copied into swift://container.service/data/a.txt

The way distcp works is that first it will upload "/data/a.txt" into swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_000000_0

Upon completion distcp will move   swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_000000_0
into swift://container.mil01/data/a.txt
************************************
The temporary file naming convention assumes that each map task will sequentially create objects
as swift://container.mil01/.distcp.tmp.attempt_ID
and then rename them to the final names.  Most of Hadoop eco system components use object.name
which is part of the temporary name, however distcp doesn't use such approach. 

This JIRA propose to add a configuration key indicating that temporary objects will also include
object name as part of their temporary file name,

For example
"/data/a.txt" will be uploaded into "swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_000000_0/a.txt"
or 
"swift://container.mil01/data/a.txt.distcp.tmp.attempt_local2036034928_0001_m_000000_0"

"a.txt.distcp.tmp.attempt_local2036034928_0001_m_000000_0" doesn't affects flows in the drivers,
since "a.txt" is not considered as a sub-directory so no special operations will be taken.
The benefit is that different systems may expect "a.txt.distcp.tmp.attempt_local2036034928_0001_m_000000_0"
and extract value prior "distcp.tmp"


> Each map task should create a unique temporary name that includes an object name
> --------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6854
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6854
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>            Reporter: Gil Vernik
>
> Consider an example: a local file "/data/a.txt"  need to be copied into swift://container.service/data/a.txt
> The way distcp works is that first it will upload "/data/a.txt" into swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_000000_0
> Upon completion distcp will move   swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_000000_0
into swift://container.mil01/data/a.txt
> ************************************
> The temporary file naming convention assumes that each map task will sequentially create
objects as swift://container.mil01/.distcp.tmp.attempt_ID
> and then rename them to the final names.  Most of Hadoop eco system components use object.name
which is part of the temporary name, however distcp doesn't use such approach. 
> This JIRA propose to add a configuration key indicating that temporary objects will also
include object name as part of their temporary file name,
> For example
> "/data/a.txt" will be uploaded into 
> "swift://container.mil01/data/a.txt.distcp.tmp.attempt_local2036034928_0001_m_000000_0"
> "a.txt.distcp.tmp.attempt_local2036034928_0001_m_000000_0" doesn't affects flows in the
access drivers, since "a.txt" is not considered as a sub-directory so no special operations
will be taken. The benefit is that different systems may expect "a.txt.distcp.tmp.attempt_local2036034928_0001_m_000000_0"
and extract value prior "distcp.tmp"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org


Mime
View raw message