hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sun, Keith" <ai...@ebay.com>
Subject Re: Why the filter push down does not reduce the read data record count
Date Fri, 23 Feb 2018 15:48:56 GMT
I got your point and thanks for the nice slides info.


So the parquet filter is not an easy thing and I will try that according to the deck.


Thanks !

________________________________
From: Furcy Pin <pin.furcy@gmail.com>
Sent: Friday, February 23, 2018 3:37:52 AM
To: user@hive.apache.org
Subject: Re: Why the filter push down does not reduce the read data record count

Hi,

Unless your table is partitioned or bucketed by myid, Hive generally requires to read through
all the records to find the records that match your predicate.

In other words, Hive table are generally not indexed for single record retrieval like you
would expect RDBMs tables or Vertica tables to be indexed to allow single record.
Some file formats like ORC (and maybe Parquet, I'm not sure) allow to add bloom filters on
specific columns of a table<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsnippetessay.wordpress.com%2F2015%2F07%2F25%2Fhive-optimizations-with-indexes-bloom-filters-and-statistics%2F&data=02%7C01%7Caisun%40ebay.com%7C65fc6c45d6394d53c25508d57ab204ff%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549827365379178&sdata=rqlaV994fEVnDts8xeKJ3gysOkG738Q6iAi5aWnLTrM%3D&reserved=0>,
which could work as a kind of index.
Also, depending on the query engine you are using (Hive, Spark-SQL, Impala, Presto...) and
its version, they may or may not be able to leverage certain storage optimization.
For example, Spark still does not support Hive Bucketed Table optimization. But it might come
in the upcoming Spark 2.3.


I'm much less familiar with Parquet, so if anyone has links to a good documentation for Parquet
fine tuning (or even better a comparison with ORC features) that would be really helpful.
By googling, I found these slides where someone at Netflix<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FRyanBlue3%2Fparquet-performance-tuning-the-missing-guide&data=02%7C01%7Caisun%40ebay.com%7C65fc6c45d6394d53c25508d57ab204ff%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549827365379178&sdata=Ek%2BezplTbMr5m8xmHFICmwkWIBKhO39zWARXNKCrR18%3D&reserved=0>
seems to have tried the same kind of optimization as you in Parquet.





On 23 February 2018 at 12:02, Sun, Keith <aisun@ebay.com<mailto:aisun@ebay.com>>
wrote:

Hi,


Why Hive still read so much "records" even with a filter pushdown enabled and the returned
dataset would be a very small amount ( 4k out of  30billion records).


The "RECORDS_IN" counter of Hive which still showed the 30billion count and also the output
in the map reduce log like this :

org.apache.hadoop.hive.ql.exec.MapOperator: MAP[4]: records read - 100000


BTW, I am using parquet as stoarg format and the filter pushdown did work as i see this in
log :


AM INFO: parquet.filter2.compat.FilterCompat: Filtering using predicate: eq(myid, 223)


Thanks,

Keith



Mime
View raw message