crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan Brush (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CRUNCH-481) Support independent output committers for multiple outputs
Date Thu, 05 Feb 2015 21:34:35 GMT

     [ https://issues.apache.org/jira/browse/CRUNCH-481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ryan Brush updated CRUNCH-481:
------------------------------
    Attachment: CRUNCH-481-hadoop-2-compat.patch

The exception I got above was caused by the fact that Kite's output committer uses the job
ID for a temporary staging area, and when using multiple outputs with the same name, they
collided. (I'm not very familiar with the commiter logic, but for some reason this wasn't
exposed when running against Hadoop 1.)

I've attached a patch that works around this by "decorating" the ID in Job instance that is
fabricated for each output with the output name itself. So the job names seen by the output
format would be job_12345_out0, job_12345_out1, and so on. This avoids the name collision
and works with both Hadoop 1 and 2 builds. All Crunch tests pass as well.

Is this a good approach? The alternative would be to change Kite to use something besides
the job ID for its temporary output location.

> Support independent output committers for multiple outputs
> ----------------------------------------------------------
>
>                 Key: CRUNCH-481
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-481
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core
>            Reporter: Aniket Kulkarni
>            Assignee: Josh Wills
>            Priority: Minor
>             Fix For: 0.12.0
>
>         Attachments: CRUNCH-481-hadoop-2-compat.patch, CRUNCH-481.patch, CRUNCH-481.patch,
CRUNCH-481.patch, CRUNCH-481c.patch
>
>
> I faced this issue while trying to write to Kite and HDFS in the same pipeline. A similar
issue was logged for Kite[1][2]. 
> I was attempting to write a PCollection to Kite and a different PTable to HDFS as a text
file. The write to Kite succeeded, however the write to HDFS only produced a _SUCCESS file
with no text file.
> Commenting out the write to Kite seems to solve the issue and I can see the text file
being written.
> [1] - https://issues.cloudera.org/browse/CDK-756
> [2] - http://mail-archives.apache.org/mod_mbox/crunch-dev/201401.mbox/%3CCAF-WD4QCUe0Toh3qewpDNnom3u786PVJLgH7T6Go_AbcTpLTaw@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message