spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin (JIRA)" <>
Subject [jira] [Updated] (SPARK-7075) Project Tungsten (Spark 1.5 Phase 1)
Date Thu, 06 Aug 2015 18:16:16 GMT


Reynold Xin updated SPARK-7075:
    Epic Name: Tungsten Phase 1  (was: Tungsten 1.5)

> Project Tungsten (Spark 1.5 Phase 1)
> ------------------------------------
>                 Key: SPARK-7075
>                 URL:
>             Project: Spark
>          Issue Type: Epic
>          Components: Block Manager, Shuffle, Spark Core, SQL
>            Reporter: Reynold Xin
>            Assignee: Reynold Xin
> Based on our observation, majority of Spark workloads are not bottlenecked by I/O or
network, but rather CPU and memory. This project focuses on 3 areas to improve the efficiency
of memory and CPU for Spark applications, to push performance closer to the limits of the
underlying hardware.
> *Memory Management and Binary Processing*
> - Avoiding non-transient Java objects (store them in binary format), which reduces GC
> - Minimizing memory usage through denser in-memory data format, which means we spill
> - Better memory accounting (size of bytes) rather than relying on heuristics
> - For operators that understand data types (in the case of DataFrames and SQL), work
directly against binary format in memory, i.e. have no serialization/deserialization
> *Cache-aware Computation*
> - Faster sorting and hashing for aggregations, joins, and shuffle
> *Code Generation*
> - Faster expression evaluation and DataFrame/SQL operators
> - Faster serializer
> Several parts of project Tungsten leverage the DataFrame model, which gives us more semantics
about the application. We will also retrofit the improvements onto Spark’s RDD API whenever

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message