hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ja Sam <ptrstp...@gmail.com>
Subject Optimize hive external tables with serde
Date Tue, 21 Oct 2014 17:37:19 GMT
*Part 1: my enviroment*

I have following files uploaded to Hadoop:

   1. The are plain text
   2. Each line contains JSON like:

{code:[int], customerId:[string], data:{[something more here]}}

   1. code are numbers from 1 to 3000,
   2. customerId are total up to 4 millions, daily up to 0.5 millon
   3. All files are gzip
   4. In hive I created external table with custom JSON serde (let's call
   it CUSTOMER_DATA)
   5. All files from each date is stored in separate directory - and I use
   it as partitions in Hive tables

Most queries which I do are filtering by date, code and customerId. I have
also a second file with format (let's call it CUSTOMER_ATTRIBUTES]:
[customerId] [attribute_1] [attribute_2] ... [attribute_n] which contains
data for all my customers, so rows are up to 4 millions.

I query and filter my data in following way:

   1. Filtering by date - partitions do the job here using WHERE
   partitionDate IN (20141020,20141020)
   2. Filtering by code using statement like for example `WHERE code IN
   (1,4,5,33,6784)
   3. Joining table CUSTOMER_ATTRIBUTES with CUSTOMER_DATA with condition
   query like  SELECT customerId FROM CUSTOMER_DATA JOIN
   CUSTOMER_ATTRIBUTES ON
   (CUSTOMER_ATTRIBUTES.customerId=CUSTOMER_DATA.customerId) WHERE
   CUSTOMER_ATTRIBUTES.attribute_1=[something]

*Part 2: question*

Is there any efficient way how can I optimize my queries. I read about
indexes and buckets by I don't know if I can use them with external tables
and if they will optimize my queries.

Mime
View raw message