hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Esteban Gutierrez <este...@cloudera.com>
Subject Re: Custom Mapper and Reducer vs HiveQL in terms of Performance
Date Fri, 13 Jul 2012 00:57:06 GMT
Raihan,

There is no need to implement a custom mapper or reducer. If you are
experiencing issues with performance you might consider to use bucketized
tables and do a bucketed map join/ sorted merge map join. A good example of
performance in joins can be found in this slide from Facebook:
https://cwiki.apache.org/Hive/presentations.data/Hive%20Summit%202011-join.pdfbut
basically you need to choose a good strategy depending on your data.

Regards,
Esteban.





--
Cloudera, Inc.




On Thu, Jul 12, 2012 at 2:18 PM, Raihan Jamal <jamalraihan@gmail.com> wrote:

> Sending it again. As I haven't got any reply on this. Any personal
> experience will be appreciated.
>
>
>
> *Raihan Jamal*
>
>
>
> On Mon, Jul 9, 2012 at 3:37 PM, Raihan Jamal <jamalraihan@gmail.com>wrote:
>
>>  *Problem Statement:-*
>>
>> I need to compare two tables Table1 and Table2 and they both store same
>> thing. So I need to compare Table2 with Table1 as Table1 is the main
>> table through which comparisons need to be made. So after comparing I need
>> to make a report that Table2 has some sort of discrepancy. And these two
>> tables has lots of data, around TB of data. So currently I have written
>> HiveQL to do the comparisons and get the data back.
>>
>> So my question is which is better in terms of PERFORMANCE, writing a CUSTOM
>> MAPPER and REDUCERto do this kind of job or the HiveQL that I wrote will
>> be fine as I will be joining these two tables on millions of records. As
>> far as I know HiveQL internally (behind the scenes) generates optimized
>> custom map-reducer and submits for execution and gets back the results.
>>
>>
>> *Raihan Jamal*
>>
>>
>

Mime
View raw message