hawq-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Buzolin <Dmitry.Buzo...@theice.com>
Subject RE: Very poor Hawq HDFS perfromance
Date Fri, 13 Jan 2017 03:02:24 GMT
Hi Zhanwei,

Thanks for the points. Indeed I see half of the files under /hawq_default are 0 length and
many many files are 2-4-8MB:

-rw-------   3 gpadmin gpadmin          0 2017-01-11 17:40 /hawq_default/16385/16508/166351/89
-rw-------   3 gpadmin gpadmin          0 2017-01-11 17:40 /hawq_default/16385/16508/166351/9
-rw-------   3 gpadmin gpadmin          0 2017-01-11 17:40 /hawq_default/16385/16508/166351/90

So, maybe what happens is HAWQ just read those small files most of the time of the time and
it is where CPU spins. This is also in sync with almost no I/O in HDFS. Is there way to control
this behavior in HAWQ? 500GB dataset is not a small one for 10 node cluster, there has to
be a way to make this distribution more effective (i.e. without lot of 0 length and small
files). Or should I start looking at how TPC-DS test generates tables? I am using default
HAWQ configuration and TPC-DS test from here: https://github.com/pivotalguru/TPC-DS

Thanks,
Dmitry.

From: Zhanwei Wang [mailto:wangzw@apache.org]
Sent: Thursday, January 12, 2017 9:10 PM
To: user@hawq.incubator.apache.org
Subject: Re: Very poor Hawq HDFS perfromance

WARNING - External email; exercise caution
Hi Dmitry

1) According to the information you provided, your query performance is limited by the CPU.
So you will see the low HDFS access performance.
2) If you use partition table, it will increase the number of files.
3) All HAWQ table is distributed over all segments, so for small table, it will cause small
files on HDFS.


Best Regards

Zhanwei Wang
wangzw@apache.org<mailto:wangzw@apache.org>



在 2017年1月13日,上午9:42,Dmitry Buzolin <Dmitry.Buzolin@theice.com<mailto:Dmitry.Buzolin@theice.com>>
写道:

Hi All,

I see very strange picture when running hawq TPC-DS benchmark.
The data generation phase for 500BG data set showed 1.9GB/sec through put o=
n our 9 node Hadoop cluster.
The table analyze phase showed 3.2GB/sec throughput. However the test itsel=
f shows very poor HDFS performance:

  *   test run as: ./rollout.sh 100 false tpcds true 5 true true true true =
true true true true true 1
  *   ~90MB/sec for read and writes clusterwide. I've seen 1.9GB/sec during=
dataload and table analyze phase.
  *   72 postgres processes on each datanode and they consume 80% - 90% of =
CPU each and 0% MEM, doing very little I/O.
  *   35036 files on HDFS with 2MB size per each file. Is this normal? Our =
block size is 128MB
  *   Top example on one of the nodes shows 0% memory allocated for Postgre=
s but processes are heavily busy:
Cpu(s):  0.0%us, 77.4%sy,  0.9%ni, 21.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0=
%st
Mem:  264403536k total, 172780840k used, 91622696k free,  2986328k buffers
Swap:  4194300k total,        0k used,  4194300k free, 155959360k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
486160 gpadmin   39  19  872m  28m  10m D 88.7  0.0   4:32.52 postgres
486926 gpadmin   39  19  872m  28m  10m R 87.4  0.0   4:34.75 postgres
486405 gpadmin   39  19  872m  28m  10m R 86.4  0.0   4:23.99 postgres
487162 gpadmin   39  19  872m  28m  10m R 80.4  0.0   4:30.14 postgres
486761 gpadmin   39  19  872m  28m  10m R 78.8  0.0   4:28.41 postgres
486256 gpadmin   39  19  872m  28m  10m D 76.5  0.0   4:30.63 postgres

Please suggest explanations why this happens.


________________________________

This message may contain confidential information and is intended for specific recipients
unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient
of this message, please delete it and notify the sender. This message may not represent the
opinion of Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and does
not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the
recipient of this message is expected to provide safeguards from viruses and pursue alternate
means of communication where privacy or a binding message is desired.


________________________________

This message may contain confidential information and is intended for specific recipients
unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient
of this message, please delete it and notify the sender. This message may not represent the
opinion of Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and does
not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the
recipient of this message is expected to provide safeguards from viruses and pursue alternate
means of communication where privacy or a binding message is desired.

Mime
View raw message