Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@crunch.apache.org
Date: Tue, 25 Jun 2013 19:11:19 +0000 (UTC)
From: "Florian Laws (JIRA)" <jira@apache.org>
To: crunch-dev@incubator.apache.org
Message-ID: <JIRA.12654787.1372187377558.172814.1372187479941@arcas>
In-Reply-To: <JIRA.12654787.1372187377558@arcas>
References: <JIRA.12654787.1372187377558@arcas>
Subject: [jira] [Created] (CRUNCH-227) Write to sequence file ignores
 destination path.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

Florian Laws created CRUNCH-227:
-----------------------------------

             Summary: Write to sequence file ignores destination path.
                 Key: CRUNCH-227
                 URL: https://issues.apache.org/jira/browse/CRUNCH-227
             Project: Crunch
          Issue Type: Bug
          Components: IO
    Affects Versions: 0.6.0, 0.7.0
         Environment: Hadoop 1.0.3
            Reporter: Florian Laws


I'm trying to write a simple Crunch job that outputs a sequence file consisting of a custom Writable.

The job runs successfully, but the output is not written to the path that I specify in To.sequenceFile(), but instead to a Crunch working directory.

This happens when running the job both locally and on my 1-node Hadoop
test cluster, and it happens both with Crunch 0.6.0 and 0.7.0-SNAPSHOT as of today (38a97e5).

When using pipeline.done() instead of pipeline.run(), the Crunch working directory gets removed after execution, in that case, the output is not retained at all.


Code snippet:
---

public int run(String[] args) throws IOException {
  CommandLine cl = parseCommandLine(args);
  Path output = new Path((String) cl.getValue(OUTPUT_OPTION));
  int docIdIndex = getColumnIndex(cl, "DocID");
  int ldaIndex = getColumnIndex(cl, "LDA");

  Pipeline pipeline = new MRPipeline(DbDumpToSeqFile.class);
  pipeline.setConfiguration(getConf());
  PCollection<String> lines = pipeline.readTextFile((String)
cl.getValue(INPUT_OPTION));
  PTable<String, NamedQuantizedVecWritable> vectors = lines.parallelDo(
    new ConvertToSeqFileDoFn(docIdIndex, ldaIndex),
    tableOf(strings(), writables(NamedQuantizedVecWritable.class)));

  vectors.write(To.sequenceFile(output));

  PipelineResult res = pipeline.run();
  return res.succeeded() ? 0 : 1;
}
---

Log output from local run.
Note how the intended output path "/tmp/foo.seq" is reported in the
execution plan,
is not actually used.
---

2013-06-25 16:19:44.250 java[10755:1203] Unable to load realm info
from SCDynamicStore
2013-06-25 16:19:44 HadoopUtil:185 [INFO] Deleting /tmp/foo.seq
2013-06-25 16:19:44 FileTargetImpl:224 [INFO] Will write output files
to new path: /tmp/foo.seq
2013-06-25 16:19:45 JobClient:741 [WARN] No job jar file set.  User
classes may not be found. See JobConf(Class) or
JobConf#setJar(String).
2013-06-25 16:19:45 FileInputFormat:237 [INFO] Total input paths to process : 1
2013-06-25 16:19:45 TrackerDistributedCacheManager:407 [INFO] Creating
MAP in /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1-work--1596891011522800122
with rwxr-xr-x
2013-06-25 16:19:45 TrackerDistributedCacheManager:447 [INFO] Cached
/tmp/crunch-1128974463/p1/MAP as
/tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP
2013-06-25 16:19:45 TrackerDistributedCacheManager:470 [INFO] Cached
/tmp/crunch-1128974463/p1/MAP as
/tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP

2013-06-25 16:19:45 CrunchControlledJob:303 [INFO] Running job
"com.issuu.mahout.utils.DbDumpToSeqFile:
Text(/Users/florian/data/docdb.first20.txt)+S0+SeqFile(/tmp/foo.seq)"

2013-06-25 16:19:45 CrunchControlledJob:304 [INFO] Job status
available at: http://localhost:8080/
2013-06-25 16:19:45 Task:792 [INFO] Task:attempt_local_0001_m_000000_0
is done. And is in the process of commiting
2013-06-25 16:19:45 LocalJobRunner:321 [INFO]
2013-06-25 16:19:45 Task:945 [INFO] Task attempt_local_0001_m_000000_0
is allowed to commit now

2013-06-25 16:19:45 FileOutputCommitter:173 [INFO] Saved output of
task 'attempt_local_0001_m_000000_0' to
/tmp/crunch-1128974463/p1/output

2013-06-25 16:19:48 LocalJobRunner:321 [INFO]
2013-06-25 16:19:48 Task:904 [INFO] Task 'attempt_local_0001_m_000000_0' done.

---


This crude patch makes the output end up at the right place,
but breaks a lot of other tests.
---

--- a/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
+++ b/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
@@ -66,7 +66,7 @@ public class FileTargetImpl implements PathTarget {
   protected void configureForMapReduce(Job job, Class keyClass, Class
valueClass,
       Class outputFormatClass, Path outputPath, String name) {
     try {
-      FileOutputFormat.setOutputPath(job, outputPath);
+      FileOutputFormat.setOutputPath(job, path);
     } catch (Exception e) {
       throw new RuntimeException(e);
     }

---

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira