asterixdb-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Taewoo Kim (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ASTERIXDB-1556) Hash Table used by External hash group-by doesn't conform to the budget.
Date Thu, 08 Sep 2016 21:36:21 GMT

    [ https://issues.apache.org/jira/browse/ASTERIXDB-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15475096#comment-15475096
] 

Taewoo Kim commented on ASTERIXDB-1556:
---------------------------------------

Another discussion with [~buyingyi] regarding the hash table size estimation:

This is Yinyi's idea - rather than let the system admin set a parameter, the compiler can
provide more reasonable number using a worst-case scenario.

Based on the given group-memory, and each tuple in data table consists of at least 9 (tuple
offset, field offset, and type tag) + x bytes (real data payload), the compiler can assign
a memory budget to hash table. The details are:

Assume each tuple in the data table only have one field:

4 byte for tuple offset
4 byte for field offset
X byte for payload
1 byte for type tag

If data table occupies 32MB, hash table need the following size:
 {code} Min(32M/(8+X+1),
2^(8X)) * 8 + Min(32M/(8+X+1), 2^(8X)) * 32 {code}

1 byte:  256 * 40 / 1000 = 10KB
2 byte: 0.6 *40 = 24MB
3 byte: (32M/12) * 40 = 106 MB
4 byte: (32M/13) * 40 = 98 MB 
5 byte: (32M/14) * 40 = 91MB

So, 106 MB is the maximal value. Then, the ratio of hash table is 98 / (32 + 98) = 0.75. Even
if we change the budget, this ratio doesn't change. So, for any one field tuple, we can assign
75% of the group-memory budget to hash table.

Similarly for multiple-field tuple cases, 
2 fields (in the grouping result):
4 byte for tuple offset
8 byte for field offset
2X byte for payload
2 byte for type tag

58/(32+58) = 0.64

3 fields (in the grouping result):
4 byte for tuple offset
16 byte for field offset
3X byte for payload
3 byte for type tag

36/(32+36) = 0.53

We can calculate this ratio. In the conclusion: we can set a ratio based the number of field
and the group-memory budget. 
 

> Hash Table used by External hash group-by doesn't conform to the budget.
> ------------------------------------------------------------------------
>
>                 Key: ASTERIXDB-1556
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1556
>             Project: Apache AsterixDB
>          Issue Type: Bug
>            Reporter: Taewoo Kim
>            Assignee: Taewoo Kim
>            Priority: Critical
>              Labels: soon
>         Attachments: 2wayjoin.pdf, 2wayjoin.rtf, 2wayjoinplan.rtf, 3wayjoin.pdf, 3wayjoin.rtf,
3wayjoinplan.rtf
>
>
> When we enable prefix-based fuzzy-join and apply the multi-way fuzzy-join ( > 2),
the system generates an out-of-memory exception. 
> Since a fuzzy-join is created using 30-40 lines of AQL codes and this AQL is translated
into massive number of operators (more than 200 operators in the plan for a 3-way fuzzy join),
it could generate out-of-memory exception.
> /// Update: as the discussion goes, we found that hash table in the external hash group
by doesn't conform to the frame limit. So, an out of memory exception happens during the execution
of an external hash group by operator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message