spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-7075) Project Tungsten: Improving Physical Execution and Memory Management
Date Thu, 07 May 2015 09:20:01 GMT

     [ https://issues.apache.org/jira/browse/SPARK-7075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Reynold Xin updated SPARK-7075:
-------------------------------
    Description: 
Based on our observation, majority of Spark workloads are not bottlenecked by I/O or network,
but rather CPU and memory. This project focuses on 3 areas to improve the efficiency of memory
and CPU for Spark applications, to push performance closer to the limits of the underlying
hardware.

1. Memory Management and Binary Processing: leveraging application semantics to manage memory
explicitly and eliminate the overhead of JVM object model and garbage collection
2. Cache-aware computation: algorithms and data structures to exploit memory hierarchy
3. Code generation: using code generation to exploit modern compilers and CPUs

*High Level Goals for Memory Management and Binary Processing*
- Avoiding non-transient Java objects (store them in binary format), which reduces GC overhead.
- Minimizing memory usage through denser in-memory data format, which means we spill less.
- Better memory accounting (size of bytes) rather than relying on heuristics
- For operators that understand data types (in the case of DataFrames and SQL), work directly
against binary format in memory, i.e. have no serialization/deserialization

*High Level Goals for Cache-aware Computation*
- Faster sorting and hashing for aggregations, joins, and shuffle

*High Level Goals for Code Generation*
- Faster expression evaluation and SQL operators
- Faster serializer


Several parts of project Tungsten leverage the DataFrame model, which gives us more semantics
about the application. We will also retrofit the improvements onto Spark’s RDD API whenever
possible.


  was:
Based on our observation, majority of Spark workloads are not bottlenecked by I/O or network,
but rather CPU and memory. This project focuses on 3 areas to improve the efficiency of memory
and CPU for Spark applications, to push performance closer to the limits of the underlying
hardware.

1. Memory Management and Binary Processing: leveraging application semantics to manage memory
explicitly and eliminate the overhead of JVM object model and garbage collection
2. Cache-aware computation: algorithms and data structures to exploit memory hierarchy
3. Code generation: using code generation to exploit modern compilers and CPUs

*High Level Goals for Memory Management*
- Avoiding non-transient Java objects (store them in binary format), which reduces GC overhead.
- Minimizing memory usage through denser in-memory data format, which means we spill less.
- Better memory accounting (size of bytes) rather than relying on heuristics

*High Level Goals for Cache-aware Computation*
- Faster sorting and hashing for aggregations, joins, and shuffle

*High Level Goals for Code Generation*
- Faster expression evaluation and SQL operators
- Faster serializer


Several parts of project Tungsten leverage the DataFrame model, which gives us more semantics
about the application. We will also retrofit the improvements onto Spark’s RDD API whenever
possible.



> Project Tungsten: Improving Physical Execution and Memory Management
> --------------------------------------------------------------------
>
>                 Key: SPARK-7075
>                 URL: https://issues.apache.org/jira/browse/SPARK-7075
>             Project: Spark
>          Issue Type: Epic
>          Components: Block Manager, Shuffle, Spark Core, SQL
>            Reporter: Reynold Xin
>            Assignee: Reynold Xin
>
> Based on our observation, majority of Spark workloads are not bottlenecked by I/O or
network, but rather CPU and memory. This project focuses on 3 areas to improve the efficiency
of memory and CPU for Spark applications, to push performance closer to the limits of the
underlying hardware.
> 1. Memory Management and Binary Processing: leveraging application semantics to manage
memory explicitly and eliminate the overhead of JVM object model and garbage collection
> 2. Cache-aware computation: algorithms and data structures to exploit memory hierarchy
> 3. Code generation: using code generation to exploit modern compilers and CPUs
> *High Level Goals for Memory Management and Binary Processing*
> - Avoiding non-transient Java objects (store them in binary format), which reduces GC
overhead.
> - Minimizing memory usage through denser in-memory data format, which means we spill
less.
> - Better memory accounting (size of bytes) rather than relying on heuristics
> - For operators that understand data types (in the case of DataFrames and SQL), work
directly against binary format in memory, i.e. have no serialization/deserialization
> *High Level Goals for Cache-aware Computation*
> - Faster sorting and hashing for aggregations, joins, and shuffle
> *High Level Goals for Code Generation*
> - Faster expression evaluation and SQL operators
> - Faster serializer
> Several parts of project Tungsten leverage the DataFrame model, which gives us more semantics
about the application. We will also retrofit the improvements onto Spark’s RDD API whenever
possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message