hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Sprague <sprag...@gmail.com>
Subject Re: Improving self join time
Date Thu, 20 Mar 2014 17:28:00 GMT
I agree with your assessment of the inner query. why stop there though?
Doesn't the outer query fetch the ids of the tags that the inner query
identified?



On Thu, Mar 20, 2014 at 9:54 AM, Jeff Storey <storey.jeff@gmail.com> wrote:

> I don't think this quite fits here..I think the inner query will give me a
> list of duplicate elements and their counts, but it loses the information
> as to what id had these elements.
>
> I'm trying to find which pairs of ids have any duplicate tags.
>
>
> On Thu, Mar 20, 2014 at 11:57 AM, Stephen Sprague <spragues@gmail.com>wrote:
>
>> hmm. would this not fall under the general problem of identifying
>> duplicates?
>>
>> Would something like this meet your needs? (untested)
>>
>> select  -- outer query finds the ids for the duplicates
>>     key
>>
>> from (  -- inner query lists duplicate values
>>       select
>>        count(*) as cnt,
>>        value
>>      from
>>         foo
>>      group by
>>         value
>>      having
>>        count(*) > 1
>>      ) z
>>      join foo a on (a.value = z.value)
>> ;
>>
>> table foo is your table elements
>> key is your id,
>> value is your element
>>
>>
>> On Thu, Mar 20, 2014 at 7:03 AM, Jeff Storey <storey.jeff@gmail.com>wrote:
>>
>>> I have a table with 10 million rows and 2 columns - id (int) and element
>>> (string). I am trying to do a self join that finds any ids where the
>>> element values are the same, and my query looks like:
>>>
>>> select e1.id, e1.tag, e2.id as id2, e2.tag as tag2 from elements e1
>>> JOIN elements e2 on e1.element = e2.element WHERE e1.id < e2.id;
>>>
>>> I tested this at a smaller scale and it works well. The problem is that
>>> with 10 million rows, this becomes a bit large and I've let it run for 90
>>> minutes and it was up to 80GB of disk space and still going. The original
>>> input data was only 500MB.
>>>
>>> Is this something I can optimize in hive? Or should I be considering a
>>> different approach to the problem instead?
>>>
>>> Any guidance here would be helpful. Thank you.
>>>
>>
>>
>

Mime
View raw message