hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [incubator-hudi] bhasudha edited a comment on issue #689: [WIP] [HUDI-25] Optimize HoodieInputFormat.listStatus for faster Hive Incremental queries
Date Mon, 06 Jan 2020 22:53:07 GMT
bhasudha edited a comment on issue #689: [WIP] [HUDI-25] Optimize HoodieInputFormat.listStatus
for faster Hive Incremental queries
URL: https://github.com/apache/incubator-hudi/pull/689#issuecomment-571349958
 
 
   I rebased to latest master and verified the Hive queries in Docker Demo using the new patch.
Verified that all queries in the Demo work as expected and incremental queries leverage optimizations
in this patch when hive.fetch.task.conversion is disabled (as desired).
   
   I was able to run tests using spark.sql() against some of the production tables (both MOR
and COW types). I used  --conf spark.sql.hive.convertMetastoreParquet=false so Hive serDe
is used instead. Below is a flavor of queries that I tested. The results match between pre-fix
and post-fix hudi-spark-bundle jars. 
   
   Snapshot queries
   =============
   simple count:
   spark.sql("select count(*) from tableA where datestr = '2019-12-10'").show()
   
   non-hudi hudi datasets join:
   spark.sql("select m.col1 as colA, t.col2 as colB from table1 m left join table2 as t on
t._row_key = m.col1 and t.datestr >= '2016-01-01' join table3 c on m.col4 = c.col5 where
c.col6 = 'XYZ'").show()
   
   non-hudi non-hudi datasets join:
   spark.sql("select o.col1, count(distinct e.col2) from tableA o join tableB e on o.id =
e.id where to_date(e.col1) >= date_sub(current_date, 10) or to_date(e.col3) >= date_sub(current_date,
10) group by 1 order by 2 desc").show()
   
   hudi hudi datasets join:
   spark.sql("select t.id, count(t.load) as total_count FROM tableT t LEFT JOIN tableO o on
t.id = o.id AND o.datestr > '2019-12-28' AND NOT o.isactive WHERE t.datestr > '2019-12-28'
AND NOT t.isactive group by 1 order by 1,2 desc").show()
   
   group by, order and rank:
   spark.sql("select * from ( select *, rank() over ( partition by rg order by total_items
desc ) as row_number from ( select rg, usr, count(*) as total_items from tableA where date(datestr)
>= date('2019-10-11') and date(datestr) < date('2019-10-16') and event = 'complete'
and  SUBSTRING_INDEX(rg, '.',1) = 'adhoc' group by 1,2 order by 1, count(*) desc ) ) where
row_number <= 1").show()
   
   Incremental queries
   ==============
   spark.sql("select name, count(*) from  tableA where event_status = 'complete'  and `_hoodie_commit_time`
> '20200101235440' group by 1").show()

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message