crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <josh.wi...@gmail.com>
Subject Re: Write to sequence file ignores destination path.
Date Tue, 25 Jun 2013 14:37:13 GMT
Hey Florian,

At first glance, it seems like a bug to me. I'm curious if the result is
any different if you swap in pipeline.done() for pipeline.run()?

J


On Tue, Jun 25, 2013 at 7:30 AM, Florian Laws <florian@florianlaws.de>wrote:

> Hi all,
>
> I'm trying to write a simple Crunch job that outputs a sequence file
> consisting of a custom Writable.
>
> The job runs successfully, but the output is not written to the path
> that I specify in To.sequenceFile(),
> but instead to a Crunch working directory.
>
> This happens when running the job both locally and on my 1-node Hadoop
> test cluster,
> and it happens both with Crunch 0.6.0 and 0.7.0-SNAPSHOT as of today
> (38a97e5).
>
> Code snippet:
> ---
>
> public int run(String[] args) throws IOException {
>   CommandLine cl = parseCommandLine(args);
>   Path output = new Path((String) cl.getValue(OUTPUT_OPTION));
>   int docIdIndex = getColumnIndex(cl, "DocID");
>   int ldaIndex = getColumnIndex(cl, "LDA");
>
>   Pipeline pipeline = new MRPipeline(DbDumpToSeqFile.class);
>   pipeline.setConfiguration(getConf());
>   PCollection<String> lines = pipeline.readTextFile((String)
> cl.getValue(INPUT_OPTION));
>   PTable<String, NamedQuantizedVecWritable> vectors = lines.parallelDo(
>     new ConvertToSeqFileDoFn(docIdIndex, ldaIndex),
>     tableOf(strings(), writables(NamedQuantizedVecWritable.class)));
>
>   vectors.write(To.sequenceFile(output));
>
>   PipelineResult res = pipeline.run();
>   return res.succeeded() ? 0 : 1;
> }
> ---
>
> Log output from local run.
> Note how the intended output path "/tmp/foo.seq" is reported in the
> execution plan,
> is not actually used.
> ---
>
> 2013-06-25 16:19:44.250 java[10755:1203] Unable to load realm info
> from SCDynamicStore
> 2013-06-25 16:19:44 HadoopUtil:185 [INFO] Deleting /tmp/foo.seq
> 2013-06-25 16:19:44 FileTargetImpl:224 [INFO] Will write output files
> to new path: /tmp/foo.seq
> 2013-06-25 16:19:45 JobClient:741 [WARN] No job jar file set.  User
> classes may not be found. See JobConf(Class) or
> JobConf#setJar(String).
> 2013-06-25 16:19:45 FileInputFormat:237 [INFO] Total input paths to
> process : 1
> 2013-06-25 16:19:45 TrackerDistributedCacheManager:407 [INFO] Creating
> MAP in
> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1-work--1596891011522800122
> with rwxr-xr-x
> 2013-06-25 16:19:45 TrackerDistributedCacheManager:447 [INFO] Cached
> /tmp/crunch-1128974463/p1/MAP as
>
> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP
> 2013-06-25 16:19:45 TrackerDistributedCacheManager:470 [INFO] Cached
> /tmp/crunch-1128974463/p1/MAP as
>
> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP
>
> 2013-06-25 16:19:45 CrunchControlledJob:303 [INFO] Running job
> "com.issuu.mahout.utils.DbDumpToSeqFile:
> Text(/Users/florian/data/docdb.first20.txt)+S0+SeqFile(/tmp/foo.seq)"
>
> 2013-06-25 16:19:45 CrunchControlledJob:304 [INFO] Job status
> available at: http://localhost:8080/
> 2013-06-25 16:19:45 Task:792 [INFO] Task:attempt_local_0001_m_000000_0
> is done. And is in the process of commiting
> 2013-06-25 16:19:45 LocalJobRunner:321 [INFO]
> 2013-06-25 16:19:45 Task:945 [INFO] Task attempt_local_0001_m_000000_0
> is allowed to commit now
>
> 2013-06-25 16:19:45 FileOutputCommitter:173 [INFO] Saved output of
> task 'attempt_local_0001_m_000000_0' to
> /tmp/crunch-1128974463/p1/output
>
> 2013-06-25 16:19:48 LocalJobRunner:321 [INFO]
> 2013-06-25 16:19:48 Task:904 [INFO] Task 'attempt_local_0001_m_000000_0'
> done.
>
> ---
>
>
> This crude patch makes the output end up at the right place,
> but breaks a lot of other tests.
> ---
>
> ---
> a/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
> +++
> b/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
> @@ -66,7 +66,7 @@ public class FileTargetImpl implements PathTarget {
>    protected void configureForMapReduce(Job job, Class keyClass, Class
> valueClass,
>        Class outputFormatClass, Path outputPath, String name) {
>      try {
> -      FileOutputFormat.setOutputPath(job, outputPath);
> +      FileOutputFormat.setOutputPath(job, path);
>      } catch (Exception e) {
>        throw new RuntimeException(e);
>      }
>
> ---
>
>
> Am I doing something wrong, or is this a bug?
>
> Best,
>
> Florian
>

Mime
View raw message