asterixdb-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yingyi Bu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ASTERIXDB-1556) Hash Table used by External hash group-by doesn't conform to the budget.
Date Thu, 08 Sep 2016 21:58:20 GMT

    [ https://issues.apache.org/jira/browse/ASTERIXDB-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15475150#comment-15475150
] 

Yingyi Bu commented on ASTERIXDB-1556:
--------------------------------------

Just to add two things here:

1. The following formula is for group-by, as for each unique key, there couldn't be duplicate
entries in the hash table.
{noformat}
 Min(32M/(8+X+1), 2^(8X)) * 8 + Min(32M/(8+X+1), 2^(8X)) * 32 
{noformat}

For join, it should be:
{noformat}
   8 * 32M/(8+X+1)  +   32 * 32M/(8+X+1)
{noformat}

Usually, the payload (of a build branch) of a join is larger than that of a group-by, and
hence the hash table memory limit is less a problem.

2. In most cases, a field in the data payload is at least 4 bytes, except a few cases:
-- int8/boolean/null/missing, not a problem for grouping as #unique-keys are limited.
-- int16, not a problem for grouping as #unique-keys are limited.
-- string,  including a header bytes and content bytes.  If overall a UTF8 string is less
than 4 bytes, it's not a problem for grouping as #unique-keys are limited.

Therefore, assuming each field has at least 4 bytes for the playload should be safe, for the
group-by operation.  Of course, we can tune that in a finer level if we do more analysis when
we have the type information, e.g., count/avg aggregate functions returns an int64 (8 additional
bytes for the payload). 

So, in summary, the larger tuples in the data table are, the smaller memory footprint the
hash table requires.









> Hash Table used by External hash group-by doesn't conform to the budget.
> ------------------------------------------------------------------------
>
>                 Key: ASTERIXDB-1556
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1556
>             Project: Apache AsterixDB
>          Issue Type: Bug
>            Reporter: Taewoo Kim
>            Assignee: Taewoo Kim
>            Priority: Critical
>              Labels: soon
>         Attachments: 2wayjoin.pdf, 2wayjoin.rtf, 2wayjoinplan.rtf, 3wayjoin.pdf, 3wayjoin.rtf,
3wayjoinplan.rtf
>
>
> When we enable prefix-based fuzzy-join and apply the multi-way fuzzy-join ( > 2),
the system generates an out-of-memory exception. 
> Since a fuzzy-join is created using 30-40 lines of AQL codes and this AQL is translated
into massive number of operators (more than 200 operators in the plan for a 3-way fuzzy join),
it could generate out-of-memory exception.
> /// Update: as the discussion goes, we found that hash table in the external hash group
by doesn't conform to the frame limit. So, an out of memory exception happens during the execution
of an external hash group by operator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message