crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Florian Laws (JIRA)" <>
Subject [jira] [Created] (CRUNCH-227) Write to sequence file ignores destination path.
Date Tue, 25 Jun 2013 19:11:19 GMT
Florian Laws created CRUNCH-227:

             Summary: Write to sequence file ignores destination path.
                 Key: CRUNCH-227
             Project: Crunch
          Issue Type: Bug
          Components: IO
    Affects Versions: 0.6.0, 0.7.0
         Environment: Hadoop 1.0.3
            Reporter: Florian Laws

I'm trying to write a simple Crunch job that outputs a sequence file consisting of a custom

The job runs successfully, but the output is not written to the path that I specify in To.sequenceFile(),
but instead to a Crunch working directory.

This happens when running the job both locally and on my 1-node Hadoop
test cluster, and it happens both with Crunch 0.6.0 and 0.7.0-SNAPSHOT as of today (38a97e5).

When using pipeline.done() instead of, the Crunch working directory gets removed
after execution, in that case, the output is not retained at all.

Code snippet:

public int run(String[] args) throws IOException {
  CommandLine cl = parseCommandLine(args);
  Path output = new Path((String) cl.getValue(OUTPUT_OPTION));
  int docIdIndex = getColumnIndex(cl, "DocID");
  int ldaIndex = getColumnIndex(cl, "LDA");

  Pipeline pipeline = new MRPipeline(DbDumpToSeqFile.class);
  PCollection<String> lines = pipeline.readTextFile((String)
  PTable<String, NamedQuantizedVecWritable> vectors = lines.parallelDo(
    new ConvertToSeqFileDoFn(docIdIndex, ldaIndex),
    tableOf(strings(), writables(NamedQuantizedVecWritable.class)));


  PipelineResult res =;
  return res.succeeded() ? 0 : 1;

Log output from local run.
Note how the intended output path "/tmp/foo.seq" is reported in the
execution plan,
is not actually used.

2013-06-25 16:19:44.250 java[10755:1203] Unable to load realm info
from SCDynamicStore
2013-06-25 16:19:44 HadoopUtil:185 [INFO] Deleting /tmp/foo.seq
2013-06-25 16:19:44 FileTargetImpl:224 [INFO] Will write output files
to new path: /tmp/foo.seq
2013-06-25 16:19:45 JobClient:741 [WARN] No job jar file set.  User
classes may not be found. See JobConf(Class) or
2013-06-25 16:19:45 FileInputFormat:237 [INFO] Total input paths to process : 1
2013-06-25 16:19:45 TrackerDistributedCacheManager:407 [INFO] Creating
MAP in /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1-work--1596891011522800122
with rwxr-xr-x
2013-06-25 16:19:45 TrackerDistributedCacheManager:447 [INFO] Cached
/tmp/crunch-1128974463/p1/MAP as
2013-06-25 16:19:45 TrackerDistributedCacheManager:470 [INFO] Cached
/tmp/crunch-1128974463/p1/MAP as

2013-06-25 16:19:45 CrunchControlledJob:303 [INFO] Running job

2013-06-25 16:19:45 CrunchControlledJob:304 [INFO] Job status
available at: http://localhost:8080/
2013-06-25 16:19:45 Task:792 [INFO] Task:attempt_local_0001_m_000000_0
is done. And is in the process of commiting
2013-06-25 16:19:45 LocalJobRunner:321 [INFO]
2013-06-25 16:19:45 Task:945 [INFO] Task attempt_local_0001_m_000000_0
is allowed to commit now

2013-06-25 16:19:45 FileOutputCommitter:173 [INFO] Saved output of
task 'attempt_local_0001_m_000000_0' to

2013-06-25 16:19:48 LocalJobRunner:321 [INFO]
2013-06-25 16:19:48 Task:904 [INFO] Task 'attempt_local_0001_m_000000_0' done.


This crude patch makes the output end up at the right place,
but breaks a lot of other tests.

--- a/crunch-core/src/main/java/org/apache/crunch/io/impl/
+++ b/crunch-core/src/main/java/org/apache/crunch/io/impl/
@@ -66,7 +66,7 @@ public class FileTargetImpl implements PathTarget {
   protected void configureForMapReduce(Job job, Class keyClass, Class
       Class outputFormatClass, Path outputPath, String name) {
     try {
-      FileOutputFormat.setOutputPath(job, outputPath);
+      FileOutputFormat.setOutputPath(job, path);
     } catch (Exception e) {
       throw new RuntimeException(e);


This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message