hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raj Hadoop <hadoop...@yahoo.com>
Subject Re: Help in debugging Hive Query
Date Thu, 25 Jul 2013 18:12:43 GMT
Hi Sanjay,
 
Thanks for taking the time to write all the details. I did a silly mistake. The data type
for visit_page_num, i created it as string. The string was causing issues when I am using
the max function. A type cast to int in the query worked for me.
 
Regards,
Raj


________________________________
From: Sanjay Subramanian <Sanjay.Subramanian@wizecommerce.com>
To: "user@hive.apache.org" <user@hive.apache.org> 
Sent: Thursday, July 25, 2013 1:41 PM
Subject: Re: Help in debugging Hive Query



The query is correct but since u r creating a managed table , that is possibly creating some
issue and the records are not all getting created

This is what I would propose

CHECKPOINT  1 : Is this query running at all ?
===================================
Use this option in BOLD and run the QUERY ONLY (without any table creation) to log errors
and pipe to a log file by using nohup or some other way that u prefer
hive -hiveconf hive.root.logger=INFO,console -e

select a.evar23,sum(b.max_visit_page_num) from
(select distinct visid_high,visid_low,evar23 from web.omniture_web_data) a
JOIN
(select visid_high,visid_low,max(visit_page_num) as max_visit_page_num from omniture_web_data
group by visid_high,visid_low) b
where a.visid_high=b.visid_high and a.visid_low=b.visid_low
group by a.evar23;


CHECKPOINT 2 : Run the query (using the CREATE TABLE option) with these additional options
===========================================================================
Required params:
------------------------
SET mapreduce.job.maps=500; 
SET mapreduce.job.reduces=8; 
SET mapreduce.tasktracker.map.tasks.maximum=12; 
SET mapreduce.tasktracker.reduce.tasks.maximum=8; 
SET mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec; 
SET mapreduce.map.output.compress=true; 


Optional params:
-----------------------
If u r using compression in output , use the following ; u can change the LzoCodec to whatever
u r using for compression 
SET hive.exec.compress.intermediate=true; 
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec; 
SET mapreduce.output.fileoutputformat.compress=true; 


Thanks

Sanjay

From: Raj Hadoop <hadoopraj@yahoo.com>
Reply-To: "user@hive.apache.org" <user@hive.apache.org>, Raj Hadoop <hadoopraj@yahoo.com>
Date: Thursday, July 25, 2013 5:00 AM
To: Hive <user@hive.apache.org>
Subject: Help in debugging Hive Query


All,

I am trying to determine visits for customer from omniture weblog file using Hive.

Table: omniture_web_data
Columns: visid_high,visid_low,evar23,visit_page_num

Sample Data:
visid_high,visid_low,evar23,visit_page_num
999,888,1003,10
999,888,1003,14
999,888,1003,6
999,777,1003,12
999,777,1003,20

I want to calculate for each Customer Number ( evar23 is  Customer Number ) , total visits.
visid_high and visid_low determines a unique visit.
For each distinct visitor, calculate sum of maximum visit_page_num. In above example

14 + 20 = 34 should be the total visits for the customer 1003.

I am trying to run the following queries - Method 1 is almost the same as Method 2. Except
in Method 1 I only choose a particualr customer number 1003. In method 2 , i generalized to
all.

In Method 1 , I am getting the accurate result. In metnhod 2 , I am not getting the same result
as Method 1. 

Any suggestions on how to trouble shoot. ALso, any alternative approaches.

// Method 1
select a.evar23,sum(b.max_visit_page_num) from
(select distinct visid_high,visid_low,evar23 from web.omniture_web_data where evar23='1003')
a
JOIN
(select visid_high,visid_low,max(visit_page_num) as max_visit_page_num from omniture_web_data
where evar23='1003' group by visid_high,visid_low) b
where a.visid_high=b.visid_high and a.visid_low=b.visid_low
group by a.evar23;

///// Result of Method 1

1003    34

// Method 2

create table temp123 as
select a.evar23,sum(b.max_visit_page_num) from
(select distinct visid_high,visid_low,evar23 from web.omniture_web_data) a
JOIN
(select visid_high,visid_low,max(visit_page_num) as max_visit_page_num from omniture_web_data
group by visid_high,visid_low) b
where a.visid_high=b.visid_high and a.visid_low=b.visid_low
group by a.evar23;

select * from temp123 where evar23='1003';

// The Result of Method 2 is not the same as Method 1. It is showing a different number.



Thanks,
Raj

 

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s)
and may contain confidential and privileged information. Any unauthorized review, use, disclosure
or distribution is prohibited. If you are not the intended recipient, please contact the sender
by reply email and destroy all copies of the original message along with any attachments,
from your computer system. If you are the intended recipient, please be advised that the content
of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Mime
View raw message