hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Firas Abuzaid <fabuz...@stanford.edu>
Subject Re: Tuning Triangle Joins on Hive
Date Tue, 05 Aug 2014 20:10:21 GMT
Thanks, that's very helpful!


On Sat, Aug 2, 2014 at 12:47 PM, Lefty Leverenz <leftyleverenz@gmail.com>
wrote:

> How does indexes work in hive?
>>
>
> See the Indexes design doc
> <https://cwiki.apache.org/confluence/display/Hive/IndexDev> in the Hive
> wiki, although it hasn't been updated.
>
> -- Lefty
>
>
> On Sat, Aug 2, 2014 at 2:07 AM, chandra Reddy Bogala <
> chandra.reddy2005@gmail.com> wrote:
>
>> How does indexes work in hive? I thought file formats like ORC have
>> indexes in each block. But not a separate index that can help query
>> performance.
>> Thanks,
>> Chandra
>>
>>
>> On Fri, Aug 1, 2014 at 9:10 AM, Devopam Mittra <devopam@gmail.com> wrote:
>>
>>> Please try the following approach and let me know if you are not getting
>>> better performance:
>>>
>>> 1. Ensure indexes are present on dst , rsc columns in the respective
>>> tables.
>>> 2. Create a subset first taking r2 and r2 (i.e.: r3.src > r2.src) in a
>>> physical table, and then create index on its new src column as well
>>> 3. Join this to r1
>>>
>>> If this approach works well, then try out the WITH SELECT ... using the
>>> same approach , just no physical intermediate table will be created.
>>>
>>> Hope it helps..
>>>
>>> regards
>>> Dev
>>>
>>>
>>>
>>>
>>> On Fri, Aug 1, 2014 at 12:58 AM, Firas Abuzaid <fabuzaid@stanford.edu>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> We're running various "triangle" join queries on Hive 0.9.0, and we're
>>>> wondering if we can get any better performance. Here's the query we're
>>>> running:
>>>>
>>>> SELECT count(*)
>>>> FROM table r1 JOIN table r2 ON (r1.dst = r2.src) JOIN table r3 ON
>>>> (r2.dst = r3.src AND r3.dst = r1.src)
>>>> WHERE r1.src < r2.src AND r2.src < r3.src;
>>>>
>>>> We're currently passing the following tuning parameters as well:
>>>>
>>>> set mapred.map.tasks=120;
>>>> set mapred.reduce.tasks=120;
>>>> set mapred.tasktracker.map.tasks.maximum=8;
>>>> set mapred.tasktracker.reduce.tasks.maximum=8;
>>>> set mapred.child.java.opts=-Xmx5120m;
>>>>
>>>> The dataset we're using has 5 million nodes and 70 million edges, and
>>>> most of our time is spent on garbage collection. We have about 30 machines
>>>> in our cluster, and each machine has 45GB of RAM. Any thoughts on how we
>>>> can improve performance? Thanks in advance!
>>>>
>>>
>>>
>>>
>>> --
>>> Devopam Mittra
>>> Life and Relations are not binary
>>>
>>
>>
>

Mime
View raw message