flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lukas Kircher <lukas.kirc...@uni-konstanz.de>
Subject Problems reading Parquet input from HDFS
Date Mon, 24 Apr 2017 16:19:04 GMT
Hello,

I am trying to read Parquet files from HDFS and having problems. I use Avro for schema. Here
is a basic example:

public static void main(String[] args) throws Exception {
    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

    Job job = Job.getInstance();
    HadoopInputFormat<Void, Customer> hif = new HadoopInputFormat<>(new AvroParquetInputFormat(),
Void.class,
        Customer.class, job);
    FileInputFormat.addInputPath((JobConf) job.getConfiguration(), new org.apache.hadoop.fs.Path(
        "/tmp/tpchinput/01/customer_parquet"));
    Schema projection = Schema.createRecord(Customer.class.getSimpleName(), null, null, false);
    List<Schema.Field> fields = Arrays.asList(
        new Schema.Field("c_custkey", Schema.create(Schema.Type.INT), null, (Object) null)
    );
    projection.setFields(fields);
    AvroParquetInputFormat.setRequestedProjection(job, projection);

    DataSet<Tuple2<Void, Customer>> dataset = env.createInput(hif);
    dataset.print();
}
If I submit this to the job manager I get the following stack trace:

java.lang.NoSuchMethodError: org.apache.avro.Schema$Field.<init>(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Ljava/lang/Object;)V
	at misc.Misc.main(Misc.java:29)

The problem is that I use the parquet-avro dependency (which provides AvroParquetInputFormat)
in version 1.9.0 which relies on the avro dependency 1.8.0. The flink-core itself relies on
the avro dependency in version 1.7.7. Jfyi the dependency tree looks like this:

[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ flink-experiments ---
[INFO] ...:1.0-SNAPSHOT
[INFO] +- org.apache.flink:flink-java:jar:1.2.0:compile
[INFO] |  +- org.apache.flink:flink-core:jar:1.2.0:compile
[INFO] |  |  \- (org.apache.avro:avro:jar:1.7.7:compile - omitted for conflict with 1.8.0)
[INFO] |  \- org.apache.flink:flink-shaded-hadoop2:jar:1.2.0:compile
[INFO] |     \- (org.apache.avro:avro:jar:1.7.7:compile - omitted for duplicate)
[INFO] \- org.apache.parquet:parquet-avro:jar:1.9.0:compile
[INFO]    \- org.apache.avro:avro:jar:1.8.0:compile

Fixing the above NoSuchMethodError just leads to further problems. Downgrading parquet-avro
to an older version creates other conflicts as there is no version that uses avro 1.7.7 like
Flink does.

Is there a way around this or can you point me to another approach to read Parquet data from
HDFS? How do you normally go about this?

Thanks for your help,
Lukas




Mime
View raw message