hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Hammerbacher (JIRA)" <>
Subject [jira] Updated: (HIVE-135) need more accurate way of tracking memory consumption on map side aggregates
Date Sun, 14 Dec 2008 23:09:44 GMT


Jeff Hammerbacher updated HIVE-135:

    Component/s: Query Processor

Adding to "Query Processor" component.

> need more accurate way of tracking memory consumption on map side aggregates
> ----------------------------------------------------------------------------
>                 Key: HIVE-135
>                 URL:
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
> from email thread:
> Just trying it out - I am confused by one thing:
> hive> set;
> set;
> hive> explain from mytable u insert overwrite directory '/user/jssarma/tmp_agg' select
u.a, avg(size(u.b)) group by u.a;
>  everything looks good. Now I submit this query and this is what I see on the tracker:
> Map input records 87,912,961 0 87,912,961 
> Map output records 87,912,960 0 87,912,960
> This doesn't make sense. With map-side aggregates - we should be getting vastly reduced
number of rows emitted from mapper.
> I am wondering whether we should rethink our flushing logic. The freeMemory() call is
not reliable (since it doesn't account for stuff that's not cleaned out by GC). Perhaps we
should switch to an explicit setting for amount of memory for hash tables (we do know the
size of each hash table entry and overall size and should be able to guess reasonably). From
what Dhruba reported - there's no way to call the garbage collector and wait for it to complete
(to get a more accurate report of free memory). so the whole route of obtaining free memory
seems a little hosed.
> by way of comparison - hadoop also estimates memory usage in sorting. there - the sort
run is just stored in a sequential stream and it just takes the size of the stream and compares
it to max allowed sort memory usage (which is a configuration option)

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message