spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-6738) EstimateSize is difference with spill file size
Date Tue, 07 Apr 2015 12:38:12 GMT

    [ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14483103#comment-14483103
] 

Sean Owen commented on SPARK-6738:
----------------------------------

To be clear I am asking how big the data being spilled is in memory. The GC state isnt relevant.
That is, are they just compressing 10x on serialization into the files you see? It is not
crazy.

> EstimateSize  is difference with spill file size
> ------------------------------------------------
>
>                 Key: SPARK-6738
>                 URL: https://issues.apache.org/jira/browse/SPARK-6738
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.2.0
>            Reporter: Hong Shen
>
> ExternalAppendOnlyMap spill 2.2 GB data to disk:
> {code}
> 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling in-memory
map of 2.2 GB to disk (61 times so far)
> 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
> {code}
> But the file size is only 2.2M.
> {code}
> ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/
> total 2.2M
> -rw-r----- 1 spark users 2.2M Apr  7 20:27 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
> {code}
> The GC log show that the jvm memory is less than 1GB.
> {code}
> 2015-04-07T20:27:08.023+0800: [GC 981981K->55363K(3961344K), 0.0341720 secs]
> 2015-04-07T20:27:14.483+0800: [GC 987523K->53737K(3961344K), 0.0252660 secs]
> 2015-04-07T20:27:20.793+0800: [GC 985897K->56370K(3961344K), 0.0606460 secs]
> 2015-04-07T20:27:27.553+0800: [GC 988530K->59089K(3961344K), 0.0651840 secs]
> 2015-04-07T20:27:34.067+0800: [GC 991249K->62153K(3961344K), 0.0288460 secs]
> 2015-04-07T20:27:40.180+0800: [GC 994313K->61344K(3961344K), 0.0388970 secs]
> 2015-04-07T20:27:46.490+0800: [GC 993504K->59915K(3961344K), 0.0235150 secs]
> {code}
> The estimateSize  is hugh difference with spill file size



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message