spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <>
Subject [jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
Date Sun, 05 Mar 2017 13:48:33 GMT


Sean Owen commented on SPARK-19656:

PS I should be concrete about why I think the original code doesn't work -- it doesn't compile
because you're using newAPIHadoopFile whereas the example you follow uses hadoopFile. If you
adjusted that, then I think you're getting back an Avro GenericRecord as expected. Avro has
its own records in a file, not your objects. You need to get() your type out of it?

But that's an issue in your code. I think the reason this went to DataFrame / Dataset is that
there is first-class support for Avro there where your types get unpacked. That's the righter
way to do this anyway, although, shouldn't be much reason you can't do this with RDDs if you

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> ------------------------------------------------------------------
>                 Key: SPARK-19656
>                 URL:
>             Project: Spark
>          Issue Type: Question
>          Components: Java API
>    Affects Versions: 2.0.2
>            Reporter: Nira Amit
> If I understand correctly, in scala it's possible to load custom objects from avro files
to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I could. I
created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey<MyCustomClass>{};
> public static class MyCustomAvroReader extends AvroRecordReaderBase<MyCustomAvroKey,
NullWritable, MyCustomClass> {
> // with my custom schema and all the required methods...
>     }
> public static class MyCustomInputFormat extends FileInputFormat<MyCustomAvroKey, NullWritable>{
>         @Override
>         public RecordReader<MyCustomAvroKey, NullWritable> createRecordReader(InputSplit
inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException
>             return new MyCustomAvroReader();
>         }
>     }
> ...
> JavaPairRDD<MyCustomAvroKey, NullWritable> records =
>                 sc.newAPIHadoopFile("file:/path/to/datafile.avro",
>                         MyCustomInputFormat.class, MyCustomAvroKey.class,
>                         NullWritable.class,
>                         sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` actually returns
a `GenericData$Record` in runtime, not a `MyCustomClass` instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast
to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message