flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@apache.org>
Subject Re: TPC -H Benchmark
Date Mon, 22 Sep 2014 10:25:00 GMT
Hi Alex,

these jobs are implemented in a way that they read text data from HDFS.
This is a very inefficient (yet very portable and easy-to-use) format to
read relational data.
There are several formats which are much better suited to read relational
data such as Hive's ORC or Parquet (also in Apache Incubation).

The performance problems with text files are manifold:
- Data representation is not native but must be parsed (CPU intensive)
- Data representation is inefficient (an integer might need several
characters where 4 bytes would suffice)
- All data must be read, even columns that are not used by the query.
- No support to push filters down for early filtering

You could port the jobs to use an ORC or Parquet format. Either use
Hadoop's InputFormats (Flink supports those) or port them to Flink
InputFormats (which are very similar to Hadoop's). Using Hadoop's formats
might have a little overhead but will be easier...
Having said that, it is not uncommon that I/O is the bottleneck in data
processing systems.

Let us know, if you need any help.

Cheers, Fabian


2014-09-22 12:12 GMT+02:00 Alexandros Papadopoulos <alex.pap.cs@gmail.com>:

> Hello all,
>
>   i am trying to run some relational queries on flink over yarn,
> i found two repo (https://github.com/stratosphere/stratosphere-tpch,
> https://github.com/project-flink/flink-perf ) with the java and scala
> implementation for some of the bench queries.
> Running some of them with scale factor 64 the reading of the dataset seems
> to be bottleneck.
> Cause im new in the flink community, is there any way to implement those
> queries more efficient ?
> Also are there any results of this benchmark for the flink-yarn ??
>
> Thanks in advance,
>
> Alex
>

Mime
View raw message