hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sanjay Subramanian <>
Subject Re: Optimize hive external tables with serde
Date Wed, 22 Oct 2014 15:40:30 GMT
The non gzipped version will be faster.One of the usecases for using a GZIP file is when u
have input records that are multiline long formats exampleIn the following case, u have to
send all contents between begin and end sections to the mapper. If u do NOT gzip this file
(containing multiple records) then HDFS will split it in 128MB blocks(whatever u have set
the HDFS block size to be)  and it may happen that part of contents between the begin and
end section get split and u will get errors processing the records.
[begin_20100922_12345678_serverid_foofla01]blah bhah blah bhah blah bhah blah bhah blah
bhah blah bhah [end_20100922_12345678_serverid_foofla01]
So if your record is not multi line then don't use GZIP. I have used LZO which is fast but
getting LZO to work correctly needs little bit of work on your cluster. Its not difficult
but just a few steps to do. U can refer my documentation on the Hive site.  I rewrote your query.
(The group by is pushed to the outer query. And I created an initial subset jt1 to use later
in the query)I did not understand two things in your queryThe join clause   ---   ON (1=1)
?The where clause  ---   1=1 ? 
a) Code heatmap - how many users used some "code"    WITH         jt1    AS  (SELECT 
*  FROM json_table        WHERE        1=1   AND        code in (1,3,54,300,222,111)
  AND        partitionDate > '20140922'    AND        partitionDate <= '20141022'
  )   SELECT    A.code,    A.NumberOfDistinctCustomers,    B.NumberOfAllDistinctCustomers, 
  A.NumberOfEntries,    B.NumberOfAllEntries  FROM    (SELECT        jt1.code as
code,      COUNT(DISTINCT jt1.customerId) AS NumberOfDistinctCustomers,  COUNT(*) AS NumberOfEntries 
   FROM        jt1     ) A  JOIN    (SELECT  COUNT(DISTINCT jt1.customerId) AS
NumberOfAllDistinctCustomers,  COUNT(*) AS NumberOfAllEntries      FROM  jt1     )
B  ON     (1=1)  GROUP BY     A.code  ;  
b) Select users which use some code, but have some special attributes in second table:NOTE
: You are querying this with no partition so if json_table has 500-1000+ partitions then this
query might be really slow or may not runOne way you can solve this is to write a bash script
that loops though a set of partitions and u store the results in an intermediate Hive managed
table and then have another query to aggregate that.
SELECT      jt.customerIdFROM   (SELECT        *   FROM        json_table   
    WHERE         1=1   AND         code in (1,3,54,300,222,111)   ) jt JOIN  
    (SELECT            *       FROM            attribute_table       WHERE 
          attribute_X1='1'    AND          attribute_X2='1'   ) atON      jt.customerId
= at.customerId

      From: ptrstpppp <>
 To:; Sanjay Subramanian <> 
 Sent: Wednesday, October 22, 2014 1:02 AM
 Subject: Re: Optimize hive external tables with serde
ad 1) My files are not bigger than Block Size. To be precise all data from one day are up
to 2GB gzipped. Unzipped they are ~15GB. The are split in one folder into files less then
block size (block size in my case is 128MB, files are ~100MB). I can transform them to other
format if you think it will speed up my queries, but as I understand the documentation, such
structure should be same as split-able files.If I add one more partition dimension with code,
so I have partition by date and code, single files is 3-10MB. Because unique codes I have

ad 2)   I read your code and I use same serde only newer one. Should I check something more?
Ad 4) Most  queries will be about last 3-6 months
Ad 3) My example queries are:a) Code heatmap - how many users used some "code" SELECT    A.code, 
  A.NumberOfDistinctCustomers,    B.NumberOfAllDistinctCustomers,    A.NumberOfEntries, 
  B.NumberOfAllEntries FROM  (   SELECT  code,  COUNT(DISTINCT customerId) AS NumberOfDistinctCustomers, 
COUNT(*) AS NumberOfEntries    FROM  json_table   WHERE 1=1    AND code in (1,3,54,300,222,111)
   AND partitionDate > '20140922' AND partitionDate <= '20141022'    GROUP BY code
) A JOIN (   SELECT  COUNT(DISTINCT customerId) AS NumberOfAllDistinctCustomers,  COUNT(*)
AS NumberOfAllEntries    FROM  json_table   WHERE 1=1    AND code in (1,3,54,300,222,111)
   AND partitionDate > '20140922' AND partitionDate <= '20141022'  ) B ON(1=1);
b) Select users which use some code, but have some special attributes in second table: SELECT
json_table.customerId  FROM json_table   JOIN attribute_table ON (json_table.customerId=attribute_table.customerId) 
 AND attribute_table.attribute_X1='1'  AND attribute_table.attribute_X2='1'  AND json_table.code
in (1,3,54,300,222,111);
c) Mix of above.

On Tue, Oct 21, 2014 at 11:12 PM, Sanjay Subramanian <> wrote:

1. The gzip files are not splittable, so gzip itself will make the queries slower.
2. As a reference for JSON serdes , here is a example from my blog
3. Need to see your query first to try and optimize it
4. Even if you have datewise partitions and u have 5 years of data i.e. about 1825 partitions. 
 -- Trying to do a select count(*) without where clause might make hive crawl.

      From: Ja Sam <>
 Sent: Tuesday, October 21, 2014 10:37 AM
 Subject: Optimize hive external tables with serde
Part 1: my enviromentI have following files uploaded to Hadoop:   
   - The are plain text
   - Each line contains JSON like:
{code:[int], customerId:[string], data:{[something more here]}}   
   - code are numbers from 1 to 3000,
   - customerId are total up to 4 millions, daily up to 0.5 millon
   - All files are gzip
   - In hive I created external table with custom JSON serde (let's call it CUSTOMER_DATA)
   - All files from each date is stored in separate directory - and I use it as partitions
in Hive tables
Most queries which I do are filtering by date, code and customerId. I have also a second
file with format (let's call it CUSTOMER_ATTRIBUTES]: [customerId] [attribute_1] [attribute_2]
... [attribute_n]which contains data for all my customers, so rows are up to 4 millions.I
query and filter my data in following way:   
   - Filtering by date - partitions do the job here using WHERE partitionDate IN (20141020,20141020)
   - Filtering by code using statement like for example `WHERE code IN (1,4,5,33,6784)
   - Joining table CUSTOMER_ATTRIBUTES with CUSTOMER_DATA with condition query like SELECT
WHERE CUSTOMER_ATTRIBUTES.attribute_1=[something]
Part 2: questionIs there any efficient way how can I optimize my queries. I read about indexes
and buckets by I don't know if I can use them with external tables and if they will optimize
my queries.


View raw message