spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan Williams (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR
Date Wed, 12 Nov 2014 02:27:33 GMT

    [ https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207568#comment-14207568
] 

Ryan Williams commented on SPARK-3630:
--------------------------------------

I'm seeing many Snappy {{FAILED_TO_UNCOMPRESS(5)}} and {{PARSING_ERROR(2)}} errors. I just
built Spark yesterday off of [227488d|https://github.com/apache/spark/commit/227488d], so
I expected that to have picked up some of the fixes detailed in this thread. I am running
on a Yarn cluster whose 100 nodes have kernel 2.6.32 so in a few of these attempts I to used
{{spark.file.transferTo=false}} and still saw these errors.

Here are some notes about some of my runs, along with the stdout I got:
* 1000 partitions, {{spark.file.transferTo=false}}: [stdout|https://www.dropbox.com/s/141keqpojucfbai/logs.1000?dl=0].
This was my latest run; it took a while to get to my reduceByKeyLocally stage, and immediately
upon finishing the preceding stage it emitted ~190K {{FetchFailure}}s over ~200 attempts of
the stage in about one minute, followed by some Snappy errors and the job shutting down.
* 2000 partitions, {{spark.file.transferTo=false}}: [stdout|https://www.dropbox.com/s/jr1dsldodq4rvbz/logs.2000?dl=0].
This one had ~150 FetchFailures out of the gate, 
seemingly ran fine for ~8mins, then had a futures timeout, seemingly ran find for another
~17m, then got to my reduceByKeyLocally stage and died from Snappy errors.
* 2000 partitions, {{spark.file.transferTo=true}}: [stdout|https://www.dropbox.com/s/9n24ffcdq0j43ue/logs.2000.tt?dl=0].
Before running the above two, I was hoping that {{spark.file.transferTo=false}} was going
to fix my problems, so I ran this to see whether >2000 partitions was the determining factor
in the Snappy errors happening, as [~joshrosen] suggested in this thread. No such luck! ~15
FetchFailures right away, ran fine for 24mins, got to reduceByKeyLocally phase, Snappy-failed
and died.
* these and other stdout logs can be found [here|https://www.dropbox.com/sh/pn0bik3tvy73wfi/AAByFlQVJ3QUOqiKYKXt31RGa?dl=0]

In all of these I was running on a dataset (~170GB) that should be easily handled by my cluster
(5TB RAM total), and in fact I successfully ran this job against this dataset last night using
a Spark 1.1 build. That job was dying of FetchFailures when I tried to run against a larger
dataset (~300GB), and I thought maybe I needed shuffle sorting or external shuffle service,
or other 1.2.0 goodies, so I've been trying to run with 1.2.0 but can't get anything to finish.

This job reads a file in from hadoop, coalesces to the number of partitions I've asked for,
and does a {{flatMap}}, a {reduceByKey}}, a map, and a {{reduceByKeyLocally}}. I am pretty
confident that the {{Map}} I'm materializing onto the driver in the {{reduceByKeyLocally}}
is a reasonable size; it's a {{Map[Long, Long]}} with about 40K entries, and I've actually
successfully run this job on this data to materialize that exact map at different points this
week, as I mentioned before. Something causes this job to die almost immediately upon starting
the {{reduceByKeyLocally}} phase, however, usually just with Snappy errors, but with a preponderance
of FetchFailures preceding them in my last attempt.

Let me know what other information I can provide that might be useful. Thanks!

> Identify cause of Kryo+Snappy PARSING_ERROR
> -------------------------------------------
>
>                 Key: SPARK-3630
>                 URL: https://issues.apache.org/jira/browse/SPARK-3630
>             Project: Spark
>          Issue Type: Task
>          Components: Spark Core
>    Affects Versions: 1.1.0, 1.2.0
>            Reporter: Andrew Ash
>            Assignee: Josh Rosen
>
> A recent GraphX commit caused non-deterministic exceptions in unit tests so it was reverted
(see SPARK-3400).
> Separately, [~aash] observed the same exception stacktrace in an application-specific
Kryo registrator:
> {noformat}
> com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to uncompress the
chunk: PARSING_ERROR(2)
> com.esotericsoftware.kryo.io.Input.fill(Input.java:142) com.esotericsoftware.kryo.io.Input.require(Input.java:169)

> com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624)

> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)

> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)

> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)

> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)

> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> ...
> {noformat}
> This ticket is to identify the cause of the exception in the GraphX commit so the faulty
commit can be fixed and merged back into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message