crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CRUNCH-597) Unable to process parquet files using Hadoop
Date Thu, 24 Mar 2016 16:57:25 GMT

     [ https://issues.apache.org/jira/browse/CRUNCH-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Josh Wills updated CRUNCH-597:
------------------------------
    Attachment: CRUNCH-597.patch

Patch for this, which ended up being a bit more work than I had initially hoped. Without any
objections to the upgrade (I think it's fair to say that it's time, and that it's okay to
do this for 0.14.0, but LMK if I'm wrong about that), I'll commit this tomorrow.

> Unable to process parquet files using Hadoop
> --------------------------------------------
>
>                 Key: CRUNCH-597
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-597
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core, IO
>    Affects Versions: 0.13.0
>            Reporter: Stan Rosenberg
>            Assignee: Josh Wills
>         Attachments: CRUNCH-597.patch
>
>
> Current version of parquet-hadoop results in the following stack trace while attempting
to read from parquet file.
> {code}
> java.lang.Exception: java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.FileSplit
cannot be cast to parquet.hadoop.ParquetInputSplit
> 	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:406)
> Caused by: java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.FileSplit
cannot be cast to parquet.hadoop.ParquetInputSplit
> 	at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:107)
> 	at org.apache.crunch.impl.mr.run.CrunchRecordReader.initialize(CrunchRecordReader.java:140)
> 	at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:478)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:671)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
> 	at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> {code}
> Here is the relevant code snippet which yields the above stack trace when executed locally,
> {code}
> Pipeline pipeline = new MRPipeline(Crunch.class, conf);
> PCollection<Pair<String, Observation>> observations = 
>              pipeline.read(AvroParquetFileSource.builder(record).build(new Path(args[0])))
>                          .parallelDo(new TranslateFn(), Avros.tableOf(Avros.strings(),
Avros.specifics(Observation.class)));
> for (Pair<String, Observation> pair : observations.materialize()) {
>       System.out.println(pair.second());
> }
> PipelineResult result = pipeline.done();
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message