crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Attila Sasvari (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-636) Make replication factor for temporary files configurable
Date Sat, 18 Feb 2017 13:48:44 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873169#comment-15873169
] 

Attila Sasvari commented on CRUNCH-636:
---------------------------------------

One approach to do this:
- in {{createTempPath()}} of {{DistributedPipeline}: keep track of temporary directories created.
We can add a new entry to the pipeline configuration; for example ("crunch.tmp.dirs", colon
separated set of directories),  
- in {{MSCROutputHandler}}: introduce a new helper method to test whether we are dealing with
a temporary output directory. If so set "dfs.replication" to the user given "crunch.tmp.dir.replication".
This replication factor will be used by MapReduce to produce output file(s) in subsequent
 "configureForMapReduce()". We also need to make sure that the original/default replication
factor is used for non-intermediate nodes. To do this, we can set something like "dfs.replication.initial"
at the first time {{configure()}} of {{MSCROutputHandler}} is called and use this replication
setting for leaf nodes. 

I will attach a patch as soon as possible.

> Make replication factor for temporary files configurable
> --------------------------------------------------------
>
>                 Key: CRUNCH-636
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-636
>             Project: Crunch
>          Issue Type: New Feature
>            Reporter: Attila Sasvari
>            Assignee: Attila Sasvari
>
> As of now, Crunch does not allow having different replication factor for temporary files
and non-temporary files (e.g. final output data of leaf nodes) at the same time. If a user
has a large amount of data (say hundreds a of gigabytes) to process, they might want to have
lower replication factor for large temporary files between Crunch jobs. 
> We could make this configurable via a new setting (e.g. {{crunch.tmp.dir.replication}}).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message