spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-22367) Separate the serialization of class and object for iteraor
Date Mon, 06 Nov 2017 08:12:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-22367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen resolved SPARK-22367.
-------------------------------
    Resolution: Won't Fix

> Separate the serialization of class and object for iteraor
> ----------------------------------------------------------
>
>                 Key: SPARK-22367
>                 URL: https://issues.apache.org/jira/browse/SPARK-22367
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: Xianyang Liu
>
> Becuase they are all the same class for an iterator.  So there is no need write class
information for every record in the iterator. We only need write the class information once
at the serialization beginning, also only need read the class information once for deserialization.
> In this patch, we separate the serialization of class and object for an iterator serialized
by Kryo. This can improve the performance of the serialization and deserialization, and save
the space.
> Test case:
> ```scala
>     val conf = new SparkConf().setAppName("Test for serialization")
>     val sc = new SparkContext(conf)
>     val random = new Random(1)
>     val data = sc.parallelize(1 to 1000000000).map { i =>
>       Person("id-" + i, random.nextInt(Integer.MAX_VALUE))
>     }.persist(StorageLevel.OFF_HEAP)
>     var start = System.currentTimeMillis()
>     data.count()
>     println("First time: " + (System.currentTimeMillis() - start))
>     start = System.currentTimeMillis()
>     data.count()
>     println("Second time: " + (System.currentTimeMillis() - start))
> ```
> Test result:
> The size of serialized:
> before: 34.3GB
> after: 17.5GB
> | before(cal+serialization)| before(deserialization)| after(cal+serialization)| after(deserialization)
|
> | ------| ------ | ------ | ------ | 
> | 63869| 21882|  45513| 15158|
> | 59368| 21507|  51683| 15524|
> | 66230| 21481|  62163| 14903|
> | 62399| 22529|  52400| 16255|



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message