flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Pompermaier <pomperma...@okkam.it>
Subject Re: Problems reading Parquet input from HDFS
Date Fri, 28 Apr 2017 16:38:03 GMT
Hi Lukas,
a colleague of mine issued a PR to update the code of that example (
https://github.com/FelixNeutatz/parquet-flinktacular/pull/2).

We updated avro to 1.8.1 and the example worked fine (we didn't test on the
cluster yet).
Let me know if it does help..

Best,
Flavio

On Tue, Apr 25, 2017 at 1:53 PM, Lukas Kircher <
lukas.kircher@uni-konstanz.de> wrote:

> Thanks for your suggestions.
>
> @Flavio
> This is very similar to the code I use and yields basically the same
> problems. The examples are based on flink-1.0-SNAPSHOT and avro-1.7.6.
> which is more than three years old. Do you have a working setup with newer
> version of avro and flink?
>
> @Jörn
> I tried to do that but I can't see how to get around the
> AvroParquetInputFormat (see below). I can pass a schema for projection as a
> string but then I get a NullPointerException as there is no ReadSupport
> class available in ParquetInputFormat. There is a constructor to
> instantiate ParquetInputFormat with a class that extends ReadSupport but I
> haven't found a suitable one to pass to the constructor. Do you know of a
> way around this?
>
>
> *  public static void main(String[] args) throws Exception {*
> *      ExecutionEnvironment env =
> ExecutionEnvironment.getExecutionEnvironment();*
>
> *      Job job = Job.getInstance();*
> *      HadoopInputFormat<Void, Customer> hif = new HadoopInputFormat<>(new
> ParquetInputFormat(), Void.class,*
> *          Customer.class, job);*
> *      FileInputFormat.addInputPath((JobConf) job.getConfiguration(), new
> org.apache.hadoop.fs.Path(*
> *          "/tmp/tpchinput/01/customer_parquet"));*
> *      job.getConfiguration().set("parquet.avro.projection",
> "{\"type\":\"record\",\"name\":\"Customer\",\"fields\":[{\"name\":\"c_custkey\",\"type\":\"int\"}]}");*
> *      env.createInput(hif).print();*
> *  }*
>
>
> I am pretty sure that I miss something very basic? Let me know if you need
> any additional information.
>
> Thanks ...
>
>
>
> On 24 Apr 2017, at 20:51, Flavio Pompermaier <pompermaier@okkam.it> wrote:
>
> I started from this guide:
>
> https://github.com/FelixNeutatz/parquet-flinktacular
>
> Best,
> Flavio
>
> On 24 Apr 2017 6:36 pm, "Jörn Franke" <jornfranke@gmail.com> wrote:
>
>> Why not use a parquet only format? Not sure why you need an
>> avtoparquetformat.
>>
>> On 24. Apr 2017, at 18:19, Lukas Kircher <lukas.kircher@uni-konstanz.de>
>> wrote:
>>
>> Hello,
>>
>> I am trying to read Parquet files from HDFS and having problems. I use
>> Avro for schema. Here is a basic example:
>>
>> public static void main(String[] args) throws Exception {
>>     ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
>>
>>     Job job = Job.getInstance();
>>     HadoopInputFormat<Void, Customer> hif = new HadoopInputFormat<>(new
AvroParquetInputFormat(), Void.class,
>>         Customer.class, job);
>>     FileInputFormat.addInputPath((JobConf) job.getConfiguration(), new org.apache.hadoop.fs.Path(
>>         "/tmp/tpchinput/01/customer_parquet"));
>>     Schema projection = Schema.createRecord(Customer.class.getSimpleName(), null,
null, false);
>>     List<Schema.Field> fields = Arrays.asList(
>>         new Schema.Field("c_custkey", Schema.create(Schema.Type.INT), null, (Object)
null)
>>     );
>>     projection.setFields(fields);
>>     AvroParquetInputFormat.setRequestedProjection(job, projection);
>>
>>     DataSet<Tuple2<Void, Customer>> dataset = env.createInput(hif);
>>     dataset.print();
>> }
>>
>> If I submit this to the job manager I get the following stack trace:
>>
>> java.lang.NoSuchMethodError: org.apache.avro.Schema$Field.<
>> init>(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/S
>> tring;Ljava/lang/Object;)V
>> at misc.Misc.main(Misc.java:29)
>>
>> The problem is that I use the parquet-avro dependency (which provides
>> AvroParquetInputFormat) in version 1.9.0 which relies on the avro
>> dependency 1.8.0. The flink-core itself relies on the avro dependency in
>> version 1.7.7. Jfyi the dependency tree looks like this:
>>
>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
>> flink-experiments ---
>> [INFO] ...:1.0-SNAPSHOT
>> [INFO] +- org.apache.flink:flink-java:jar:1.2.0:compile
>> [INFO] |  +- org.apache.flink:flink-core:jar:1.2.0:compile
>> [INFO] |  |  \- (org.apache.avro:avro:jar:1.7.7:compile - omitted for
>> conflict with 1.8.0)
>> [INFO] |  \- org.apache.flink:flink-shaded-hadoop2:jar:1.2.0:compile
>> [INFO] |     \- (org.apache.avro:avro:jar:1.7.7:compile - omitted for
>> duplicate)
>> [INFO] \- org.apache.parquet:parquet-avro:jar:1.9.0:compile
>> [INFO]    \- org.apache.avro:avro:jar:1.8.0:compile
>>
>> Fixing the above NoSuchMethodError just leads to further problems.
>> Downgrading parquet-avro to an older version creates other conflicts as
>> there is no version that uses avro 1.7.7 like Flink does.
>>
>> Is there a way around this or can you point me to another approach to
>> read Parquet data from HDFS? How do you normally go about this?
>>
>> Thanks for your help,
>> Lukas
>>
>>
>>
>>
>

Mime
View raw message