tajo-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyunsik Choi (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TAJO-900) Reducing memory usage during query processing
Date Thu, 03 Jul 2014 03:07:24 GMT

     [ https://issues.apache.org/jira/browse/TAJO-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hyunsik Choi updated TAJO-900:
------------------------------

    Description: 
Currently, we have used tuple structures implemented as Java objects. It internally uses Datum
objects. Current Tuple structure occupies in JVM heap space. As a result, it is hard to control
memory usage, and it is impossible to predict garbage collection. This problem usually becomes
severe when Tajo deals with very large data in relatively small cluster and lots of grouping
or join keys.

I've tried various tests and I made some prototype to show the possibility to eliminate this
problem. 

The main idea is as follows:
 * Do not use Datum class in expression evaluation. Instead, we should use java primitive
type values
   ** It will significantly reduces object creations and memory usages
 * Redesign Tuple using direct memory allocation (DirectByteBuffer or Unsafe.allocateMemory)
   ** It allows each worker to control memory usages during in-memory operations like sort
and hash aggregation/joins.
   ** It enables column values to be stored in adjacent memory, improving cache locality.

In order to achieve the above idea, we should do as follows:
 * implement an alternative (i.e., runtime byte code generation) to EvalNode framework in
order to avoid use of Datum and Java objects.
 * Design new tuple data structure using direct memory allocation 
 * Refactor existing operators to be controlled according current memory usage

This is an umbrella issue. I'll create subtasks, and I've already started some issues. I'll
use this jira to track them.

  was:
Currently, we have used tuple structures implemented as Java objects. It internally uses Datum
objects. Current Tuple structure occupies in JVM heap space. As a result, it is hard to control
memory usage, and it is impossible to predict garbage collection. This problem usually becomes
severe when Tajo deals with very large data in relatively small cluster and lots of grouping
or join keys.

I've tried various tests and I made some prototype to show the possibility to eliminate this
problem. 

The main idea is as follows:
 * Do not use Datum class in expression evaluation. Instead, we should use java primitive
type values
   ** It will significantly reduces object creations and memory usages
 * Redesign Tuple using direct memory allocation (DirectByteBuffer or Unsafe.allocateMemory)
   ** It allows each worker to control memory usages during in-memory operations like sort
and hash aggregation/joins.
   ** It enables column values to be stored in adjacent memory, improving cache locality.

In order to achieve the above idea, we should do as follows:
 * implement an alternative (i.e., runtime byte code generation) to EvalNode framework in
order to avoid use of Datum and Java objects.
 * Design new tuple data structure using direct memory allocation 
 * Refactor in-memory sort, hash aggregation, hash join operators

This is an umbrella issue. I'll create subtasks, and I've already started some issues. I'll
use this jira to track them.


> Reducing memory usage during query processing
> ---------------------------------------------
>
>                 Key: TAJO-900
>                 URL: https://issues.apache.org/jira/browse/TAJO-900
>             Project: Tajo
>          Issue Type: Improvement
>          Components: physical operator, storage
>            Reporter: Hyunsik Choi
>            Assignee: Hyunsik Choi
>             Fix For: 0.9.0
>
>
> Currently, we have used tuple structures implemented as Java objects. It internally uses
Datum objects. Current Tuple structure occupies in JVM heap space. As a result, it is hard
to control memory usage, and it is impossible to predict garbage collection. This problem
usually becomes severe when Tajo deals with very large data in relatively small cluster and
lots of grouping or join keys.
> I've tried various tests and I made some prototype to show the possibility to eliminate
this problem. 
> The main idea is as follows:
>  * Do not use Datum class in expression evaluation. Instead, we should use java primitive
type values
>    ** It will significantly reduces object creations and memory usages
>  * Redesign Tuple using direct memory allocation (DirectByteBuffer or Unsafe.allocateMemory)
>    ** It allows each worker to control memory usages during in-memory operations like
sort and hash aggregation/joins.
>    ** It enables column values to be stored in adjacent memory, improving cache locality.
> In order to achieve the above idea, we should do as follows:
>  * implement an alternative (i.e., runtime byte code generation) to EvalNode framework
in order to avoid use of Datum and Java objects.
>  * Design new tuple data structure using direct memory allocation 
>  * Refactor existing operators to be controlled according current memory usage
> This is an umbrella issue. I'll create subtasks, and I've already started some issues.
I'll use this jira to track them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message