tajo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyunsik Choi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TAJO-292) Too many intermediate partition files
Date Thu, 05 Dec 2013 07:00:38 GMT

    [ https://issues.apache.org/jira/browse/TAJO-292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839885#comment-13839885

Hyunsik Choi commented on TAJO-292:

There is a missing thing. This patch also should handle symmetric repartition join. The way
is very similar to your group-by work. 

For that, please take a look at the below code. This code chooses a smaller table and gets
the proper number of partitions. After this patch is applied, the maximum number of partitions
are limited to worker slots. So, you need to choose lager tables as the base table for calculating
the number of task.

{code:title=502 line in SubQuery.java}
// for inner
        ExecutionBlock inner = childs.get(1);
        long innerVolume = getInputVolume(subQuery.masterPlan, subQuery.context, inner);
        LOG.info("Outer volume: " + Math.ceil((double)outerVolume / 1048576));
        LOG.info("Inner volume: " + Math.ceil((double)innerVolume / 1048576));

        long smaller = Math.min(outerVolume, innerVolume);

        int mb = (int) Math.ceil((double)smaller / 1048576);
        LOG.info("Smaller Table's volume is approximately " + mb + " MB");
        // determine the number of task
        int taskNum = (int) Math.ceil((double)mb /
        LOG.info("The determined number of join partitions is " + taskNum);
        return taskNum;

> Too many intermediate partition files
> -------------------------------------
>                 Key: TAJO-292
>                 URL: https://issues.apache.org/jira/browse/TAJO-292
>             Project: Tajo
>          Issue Type: Bug
>          Components: repartitioning
>    Affects Versions: 0.2-incubating
>            Reporter: Hyunsik Choi
>            Assignee: Jinho Kim
>            Priority: Critical
>             Fix For: 0.8-incubating
>         Attachments: TAJO-292.patch, TAJO-292_2.patch
> Unlike the before, the number of partitions are being currently determined by the volume
size and the number of distinct keys. It can cause unnecessary overheads. We need to improve
the partition number determiner to consider the number of cluster nodes.

This message was sent by Atlassian JIRA

View raw message