Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C1EF5E9A0 for ; Tue, 25 Jun 2013 19:11:20 +0000 (UTC) Received: (qmail 65280 invoked by uid 500); 25 Jun 2013 19:11:20 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 64957 invoked by uid 500); 25 Jun 2013 19:11:20 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 64940 invoked by uid 500); 25 Jun 2013 19:11:20 -0000 Delivered-To: apmail-incubator-crunch-dev@incubator.apache.org Received: (qmail 64936 invoked by uid 99); 25 Jun 2013 19:11:20 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Jun 2013 19:11:20 +0000 Date: Tue, 25 Jun 2013 19:11:19 +0000 (UTC) From: "Florian Laws (JIRA)" To: crunch-dev@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (CRUNCH-227) Write to sequence file ignores destination path. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Florian Laws created CRUNCH-227: ----------------------------------- Summary: Write to sequence file ignores destination path. Key: CRUNCH-227 URL: https://issues.apache.org/jira/browse/CRUNCH-227 Project: Crunch Issue Type: Bug Components: IO Affects Versions: 0.6.0, 0.7.0 Environment: Hadoop 1.0.3 Reporter: Florian Laws I'm trying to write a simple Crunch job that outputs a sequence file consisting of a custom Writable. The job runs successfully, but the output is not written to the path that I specify in To.sequenceFile(), but instead to a Crunch working directory. This happens when running the job both locally and on my 1-node Hadoop test cluster, and it happens both with Crunch 0.6.0 and 0.7.0-SNAPSHOT as of today (38a97e5). When using pipeline.done() instead of pipeline.run(), the Crunch working directory gets removed after execution, in that case, the output is not retained at all. Code snippet: --- public int run(String[] args) throws IOException { CommandLine cl = parseCommandLine(args); Path output = new Path((String) cl.getValue(OUTPUT_OPTION)); int docIdIndex = getColumnIndex(cl, "DocID"); int ldaIndex = getColumnIndex(cl, "LDA"); Pipeline pipeline = new MRPipeline(DbDumpToSeqFile.class); pipeline.setConfiguration(getConf()); PCollection lines = pipeline.readTextFile((String) cl.getValue(INPUT_OPTION)); PTable vectors = lines.parallelDo( new ConvertToSeqFileDoFn(docIdIndex, ldaIndex), tableOf(strings(), writables(NamedQuantizedVecWritable.class))); vectors.write(To.sequenceFile(output)); PipelineResult res = pipeline.run(); return res.succeeded() ? 0 : 1; } --- Log output from local run. Note how the intended output path "/tmp/foo.seq" is reported in the execution plan, is not actually used. --- 2013-06-25 16:19:44.250 java[10755:1203] Unable to load realm info from SCDynamicStore 2013-06-25 16:19:44 HadoopUtil:185 [INFO] Deleting /tmp/foo.seq 2013-06-25 16:19:44 FileTargetImpl:224 [INFO] Will write output files to new path: /tmp/foo.seq 2013-06-25 16:19:45 JobClient:741 [WARN] No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 2013-06-25 16:19:45 FileInputFormat:237 [INFO] Total input paths to process : 1 2013-06-25 16:19:45 TrackerDistributedCacheManager:407 [INFO] Creating MAP in /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1-work--1596891011522800122 with rwxr-xr-x 2013-06-25 16:19:45 TrackerDistributedCacheManager:447 [INFO] Cached /tmp/crunch-1128974463/p1/MAP as /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP 2013-06-25 16:19:45 TrackerDistributedCacheManager:470 [INFO] Cached /tmp/crunch-1128974463/p1/MAP as /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP 2013-06-25 16:19:45 CrunchControlledJob:303 [INFO] Running job "com.issuu.mahout.utils.DbDumpToSeqFile: Text(/Users/florian/data/docdb.first20.txt)+S0+SeqFile(/tmp/foo.seq)" 2013-06-25 16:19:45 CrunchControlledJob:304 [INFO] Job status available at: http://localhost:8080/ 2013-06-25 16:19:45 Task:792 [INFO] Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting 2013-06-25 16:19:45 LocalJobRunner:321 [INFO] 2013-06-25 16:19:45 Task:945 [INFO] Task attempt_local_0001_m_000000_0 is allowed to commit now 2013-06-25 16:19:45 FileOutputCommitter:173 [INFO] Saved output of task 'attempt_local_0001_m_000000_0' to /tmp/crunch-1128974463/p1/output 2013-06-25 16:19:48 LocalJobRunner:321 [INFO] 2013-06-25 16:19:48 Task:904 [INFO] Task 'attempt_local_0001_m_000000_0' done. --- This crude patch makes the output end up at the right place, but breaks a lot of other tests. --- --- a/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java +++ b/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java @@ -66,7 +66,7 @@ public class FileTargetImpl implements PathTarget { protected void configureForMapReduce(Job job, Class keyClass, Class valueClass, Class outputFormatClass, Path outputPath, String name) { try { - FileOutputFormat.setOutputPath(job, outputPath); + FileOutputFormat.setOutputPath(job, path); } catch (Exception e) { throw new RuntimeException(e); } --- -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira