asterixdb-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Taewoo Kim (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ASTERIXDB-1556) Hash Table used by External hash group-by doesn't conform to the budget.
Date Wed, 10 Aug 2016 02:04:20 GMT

    [ https://issues.apache.org/jira/browse/ASTERIXDB-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15414598#comment-15414598
] 

Taewoo Kim commented on ASTERIXDB-1556:
---------------------------------------

The summary of a Skype call with [~dtabass], [~tillw], [~buyingyi], and [~javierjia]:

#1. The following proposal written after another meeting is OK.
1) After each insertion, Data Table reports the number of frames used by DT (D) and Hash table
reports the number of frames used by HT (H).
2) If this insertion is failed. We spill a partition of Data Table to disk to find space on
Data table. 
3) If this insertion is successful, we calculate D + H. If D + H >= F, then we spill a
partition of Data table to disk.

#2. The following is not OK. If hash table size becomes huge, then we need to resolve the
problem within the hash table itself. For instance, we can keep track of how much spaces are
wasted due to a hash slot migration. And we can come up with a garbage collection method.
-4) If H >= 0.8 * F, we spill entire partitions of DT to disk and reset the Hash table.
80% can be changed. This process is required not to let Hash table occupy entire space.-

#3. Hash Table Size (= the number of possible hash values in Hash Table) should come from
the system-admin configuration file. And the compiler should observe this setting. If this
value is not provided in the file, it's OK to use the default value specified in the codebase.
Also, the sanity check between hash_table_size and groupmemory should be done. i.e. the space
occupation of hash entries in Hash Table cannot exceed the specified budget (groupmemory).


#4. Query Sanity Check should be done. After a physical plan is computed, we can calculate
the maximum memory usage per operators in each stage. So, the impossible execution of a query
based on the global memory setting can be checked and safely ignored rather than executing
an Out Of Memory exception error.



> Hash Table used by External hash group-by doesn't conform to the budget.
> ------------------------------------------------------------------------
>
>                 Key: ASTERIXDB-1556
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1556
>             Project: Apache AsterixDB
>          Issue Type: Bug
>            Reporter: Taewoo Kim
>            Assignee: Taewoo Kim
>         Attachments: 2wayjoin.pdf, 2wayjoin.rtf, 2wayjoinplan.rtf, 3wayjoin.pdf, 3wayjoin.rtf,
3wayjoinplan.rtf
>
>
> When we enable prefix-based fuzzy-join and apply the multi-way fuzzy-join ( > 2),
the system generates an out-of-memory exception. 
> Since a fuzzy-join is created using 30-40 lines of AQL codes and this AQL is translated
into massive number of operators (more than 200 operators in the plan for a 3-way fuzzy join),
it could generate out-of-memory exception.
> /// Update: as the discussion goes, we found that hash table in the external hash group
by doesn't conform to the frame limit. So, an out of memory exception happens during the execution
of an external hash group by operator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message