flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Metzger <rmetz...@apache.org>
Subject Re: TPC -H Benchmark
Date Mon, 22 Sep 2014 14:14:49 GMT
Hi Alex,

"stratosphere-tpch" programs are written against our old Scala API and we
haven't really fine-tuned them, so maybe they are not optimally implemented.

We haven't benchmarked Flink explicitly on YARN, but I don't expect the
results to be different from non-yarn setups. We use YARN just for
deploying our JobManager and TaskManagers and then run everything like we
do with direct installations.
The execution is exactly the same for YARN and non-YARN setups.

On Mon, Sep 22, 2014 at 12:25 PM, Fabian Hueske <fhueske@apache.org> wrote:

> Hi Alex,
> these jobs are implemented in a way that they read text data from HDFS.
> This is a very inefficient (yet very portable and easy-to-use) format to
> read relational data.
> There are several formats which are much better suited to read relational
> data such as Hive's ORC or Parquet (also in Apache Incubation).
> The performance problems with text files are manifold:
> - Data representation is not native but must be parsed (CPU intensive)
> - Data representation is inefficient (an integer might need several
> characters where 4 bytes would suffice)
> - All data must be read, even columns that are not used by the query.
> - No support to push filters down for early filtering
> You could port the jobs to use an ORC or Parquet format. Either use
> Hadoop's InputFormats (Flink supports those) or port them to Flink
> InputFormats (which are very similar to Hadoop's). Using Hadoop's formats
> might have a little overhead but will be easier...
> Having said that, it is not uncommon that I/O is the bottleneck in data
> processing systems.
> Let us know, if you need any help.
> Cheers, Fabian
> 2014-09-22 12:12 GMT+02:00 Alexandros Papadopoulos <alex.pap.cs@gmail.com>
> :
>> Hello all,
>>   i am trying to run some relational queries on flink over yarn,
>> i found two repo (https://github.com/stratosphere/stratosphere-tpch,
>> https://github.com/project-flink/flink-perf ) with the java and scala
>> implementation for some of the bench queries.
>> Running some of them with scale factor 64 the reading of the dataset
>> seems to be bottleneck.
>> Cause im new in the flink community, is there any way to implement those
>> queries more efficient ?
>> Also are there any results of this benchmark for the flink-yarn ??
>> Thanks in advance,
>> Alex

View raw message