hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sanjay Subramanian <sanjaysubraman...@yahoo.com>
Subject Re: Optimize hive external tables with serde
Date Tue, 21 Oct 2014 21:12:05 GMT
1. The gzip files are not splittable, so gzip itself will make the queries slower.
2. As a reference for JSON serdes , here is a example from my blog http://bigdatalatte.wordpress.com/2014/08/21/denormalizing-json-arrays-in-hive/
3. Need to see your query first to try and optimize it
4. Even if you have datewise partitions and u have 5 years of data i.e. about 1825 partitions. 
 -- Trying to do a select count(*) without where clause might make hive crawl.

      From: Ja Sam <ptrstpppp@gmail.com>
 To: user@hive.apache.org 
 Sent: Tuesday, October 21, 2014 10:37 AM
 Subject: Optimize hive external tables with serde
   
Part 1: my enviromentI have following files uploaded to Hadoop:   
   - The are plain text
   - Each line contains JSON like:
{code:[int], customerId:[string], data:{[something more here]}}   
   - code are numbers from 1 to 3000,
   - customerId are total up to 4 millions, daily up to 0.5 millon
   - All files are gzip
   - In hive I created external table with custom JSON serde (let's call it CUSTOMER_DATA)
   - All files from each date is stored in separate directory - and I use it as partitions
in Hive tables
Most queries which I do are filtering by date, code and customerId. I have also a second
file with format (let's call it CUSTOMER_ATTRIBUTES]: [customerId] [attribute_1] [attribute_2]
... [attribute_n]which contains data for all my customers, so rows are up to 4 millions.I
query and filter my data in following way:   
   - Filtering by date - partitions do the job here using WHERE partitionDate IN (20141020,20141020)
   - Filtering by code using statement like for example `WHERE code IN (1,4,5,33,6784)
   - Joining table CUSTOMER_ATTRIBUTES with CUSTOMER_DATA with condition query like SELECT
customerId FROM CUSTOMER_DATA JOIN CUSTOMER_ATTRIBUTES ON (CUSTOMER_ATTRIBUTES.customerId=CUSTOMER_DATA.customerId)
WHERE CUSTOMER_ATTRIBUTES.attribute_1=[something]
Part 2: questionIs there any efficient way how can I optimize my queries. I read about indexes
and buckets by I don't know if I can use them with external tables and if they will optimize
my queries.

  
Mime
View raw message