hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-5144) HashTableSink allocates empty new Object[] arrays & OOMs - use a static emptyRow instead
Date Fri, 23 Aug 2013 17:22:52 GMT

     [ https://issues.apache.org/jira/browse/HIVE-5144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gopal V updated HIVE-5144:
--------------------------

    Attachment: HIVE-5144.01.patch

With the attached patch, the memory usage drops from 199 Mb per million rows to approx 99
Mb per million rows.

{code}
2013-08-23 05:14:06	Processing rows:	1900000	Hashtable size:	1899999	Memory usage:	197394288
percentage:	0.391
...
OK
	2475
Dr.	40003
Mrs.	16612
Ms.	16617
Mr.	23590
Miss	16368
Sir	23394
{code}
                
> HashTableSink allocates empty new Object[] arrays & OOMs - use a static emptyRow
instead
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-5144
>                 URL: https://issues.apache.org/jira/browse/HIVE-5144
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>         Environment: Ubuntu LXC + -Xmx4096m client opts
>            Reporter: Gopal V
>            Assignee: Gopal V
>            Priority: Minor
>         Attachments: HIVE-5144.01.patch
>
>
> The map-join hashtable sink in the local-task creates an in-memory hashtable with the
following code.
> {code}
>  Object[] value = JoinUtil.computeMapJoinValues(row, joinValues[alias],
> ...
>  MapJoinRowContainer rowContainer = tableContainer.get(key);
>     if (rowContainer == null) {
>       rowContainer = new MapJoinRowContainer();
>       rowContainer.add(value);
> {code}
> But for a query where the joinValues[alias].size() == 0, this results in a large number
of unnecessary allocations which would be better served with a copy-on-write default value
container & a pre-allocated zero object array which is immutable (the only immutable array
there is in java).
> The query tested is roughly the following to scan all of customer_demographics in the
hash-sink
> {code}
> select c_salutation, count(1)
>  from customer
>       JOIN customer_demographics ON customer.c_current_cdemo_sk = customer_demographics.cd_demo_sk
>  group by c_salutation
>  limit 10
> ;
> {code}
> When running with current trunk, the code results in an OOM with 512Mb ram.
> {code}
> 2013-08-23 05:11:26	Processing rows:	1400000	Hashtable size:	1399999	Memory usage:	292418944
percentage:	0.579
> Execution failed with exit status: 3
> Obtaining error information
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message