spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nira Amit (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
Date Sun, 05 Mar 2017 22:30:32 GMT

    [ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896581#comment-15896581
] 

Nira Amit commented on SPARK-19656:
-----------------------------------

I found a problem in my schema and managed to load my custom type. So the answer to my original
question is basically no, there is nothing like 
{code}
ctx.hadoopFile("/path/to/the/avro/file.avro",
  classOf[AvroInputFormat[MyClassInAvroFile]],
  classOf[AvroWrapper[MyClassInAvroFile]],
  classOf[NullWritable])
{code}
for loading custom types into RDDs with the Java API. We have to create all the wrapper classes
and implement our own RecordReader.

I think this should be documented somewhere.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> ------------------------------------------------------------------
>
>                 Key: SPARK-19656
>                 URL: https://issues.apache.org/jira/browse/SPARK-19656
>             Project: Spark
>          Issue Type: Question
>          Components: Java API
>    Affects Versions: 2.0.2
>            Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from avro files
to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I could. I
created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey<MyCustomClass>{};
> public static class MyCustomAvroReader extends AvroRecordReaderBase<MyCustomAvroKey,
NullWritable, MyCustomClass> {
> // with my custom schema and all the required methods...
>     }
> public static class MyCustomInputFormat extends FileInputFormat<MyCustomAvroKey, NullWritable>{
>         @Override
>         public RecordReader<MyCustomAvroKey, NullWritable> createRecordReader(InputSplit
inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException
{
>             return new MyCustomAvroReader();
>         }
>     }
> ...
> JavaPairRDD<MyCustomAvroKey, NullWritable> records =
>                 sc.newAPIHadoopFile("file:/path/to/datafile.avro",
>                         MyCustomInputFormat.class, MyCustomAvroKey.class,
>                         NullWritable.class,
>                         sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` actually returns
a `GenericData$Record` in runtime, not a `MyCustomClass` instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast
to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message