kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "fengbaoli@uce.cn" <fengba...@uce.cn>
Subject spark on kudu performance!
Date Mon, 11 Jun 2018 12:52:28 GMT
 I use kudu official website development documents, use spark analysis kudu data(kudu's
version is 1.6.0):

the official  code is :
val df = sqlContext.read.options(Map("kudu.master" -> "kudu.master:7051","kudu.table" ->
"kudu_table")).kudu // Query using the Spark API... df.select("id").filter("id" >= 5).show()

My question  is :
(1)If I use the official website code, when creating data collection of df, the data of
my table is about 1.8 billion, and then the filter of df is performed. This is equivalent
to loading 1.8 billion data into memory each time, and the performance is very poor.

(2)Create a time-based range partition on the 1.8 billion table, and then directly use
the underlying java api,scan partition to analyze, this is not the amount of data each time
loading is the specified number of partitions instead of 1.8 billion data?

Please give me some suggestions, thanks!

大数据中心     冯宝利
View raw message