hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ptrstpppp <>
Subject Re: Optimize hive external tables with serde
Date Wed, 22 Oct 2014 08:02:36 GMT
ad 1) My files are not bigger than Block Size. To be precise all data from
one day are up to 2GB gzipped. Unzipped they are ~15GB. The are split in
one folder into files less then block size (block size in my case is 128MB,
files are ~100MB). I can transform them to other format if you think it
will speed up my queries, but as I understand the documentation, such
structure should be same as split-able files.
If I add one more partition dimension with code, so I have partition by
date and code, single files is 3-10MB. Because unique codes I have 3000

ad 2)   I read your code and I use same serde only newer one. Should I
check something more?

Ad 4) Most  queries will be about last 3-6 months

Ad 3) My example queries are:
a) Code heatmap - how many users used some "code"
COUNT(DISTINCT customerId) AS NumberOfDistinctCustomers,
COUNT(*) AS NumberOfEntries
  WHERE 1=1
   AND code in (1,3,54,300,222,111)
   AND partitionDate > '20140922' AND partitionDate <= '20141022'
  GROUP BY code
) A
COUNT(DISTINCT customerId) AS NumberOfAllDistinctCustomers,
COUNT(*) AS NumberOfAllEntries
  WHERE 1=1
   AND code in (1,3,54,300,222,111)
   AND partitionDate > '20140922' AND partitionDate <= '20141022'
) B

b) Select users which use some code, but have some special attributes in
second table:
SELECT json_table.customerId
 FROM json_table
 JOIN attribute_table ON (json_table.customerId=attribute_table.customerId)
 AND attribute_table.attribute_X1='1'
 AND attribute_table.attribute_X2='1'
 AND json_table.code in (1,3,54,300,222,111);

c) Mix of above.

On Tue, Oct 21, 2014 at 11:12 PM, Sanjay Subramanian <> wrote:

> 1. The gzip files are not splittable, so gzip itself will make the queries
> slower.
> 2. As a reference for JSON serdes , here is a example from my blog
> 3. Need to see your query first to try and optimize it
> 4. Even if you have datewise partitions and u have 5 years of data i.e.
> about 1825 partitions.
>    -- Trying to do a select count(*) without where clause might make hive
> crawl.
>   ------------------------------
>  *From:* Ja Sam <>
> *To:*
> *Sent:* Tuesday, October 21, 2014 10:37 AM
> *Subject:* Optimize hive external tables with serde
> *Part 1: my enviroment*
> I have following files uploaded to Hadoop:
>    1. The are plain text
>    2. Each line contains JSON like:
> {code:[int], customerId:[string], data:{[something more here]}}
>    1. code are numbers from 1 to 3000,
>    2. customerId are total up to 4 millions, daily up to 0.5 millon
>    3. All files are gzip
>    4. In hive I created external table with custom JSON serde (let's call
>    5. All files from each date is stored in separate directory - and I
>    use it as partitions in Hive tables
> Most queries which I do are filtering by date, code and customerId. I
> have also a second file with format (let's call it CUSTOMER_ATTRIBUTES]:
> [customerId] [attribute_1] [attribute_2] ... [attribute_n] which contains
> data for all my customers, so rows are up to 4 millions.
> I query and filter my data in following way:
>    1. Filtering by date - partitions do the job here using WHERE
>    partitionDate IN (20141020,20141020)
>    2. Filtering by code using statement like for example `WHERE code IN
>    (1,4,5,33,6784)
>    3. Joining table CUSTOMER_ATTRIBUTES with CUSTOMER_DATA with condition
>    query like  SELECT customerId FROM CUSTOMER_DATA JOIN
>    CUSTOMER_ATTRIBUTES.attribute_1=[something]
> *Part 2: question*
> Is there any efficient way how can I optimize my queries. I read about
> indexes and buckets by I don't know if I can use them with external tables
> and if they will optimize my queries.

View raw message