spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan Williams (JIRA)" <>
Subject [jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR
Date Wed, 12 Nov 2014 02:27:33 GMT


Ryan Williams commented on SPARK-3630:

I'm seeing many Snappy {{FAILED_TO_UNCOMPRESS(5)}} and {{PARSING_ERROR(2)}} errors. I just
built Spark yesterday off of [227488d|], so
I expected that to have picked up some of the fixes detailed in this thread. I am running
on a Yarn cluster whose 100 nodes have kernel 2.6.32 so in a few of these attempts I to used
{{spark.file.transferTo=false}} and still saw these errors.

Here are some notes about some of my runs, along with the stdout I got:
* 1000 partitions, {{spark.file.transferTo=false}}: [stdout|].
This was my latest run; it took a while to get to my reduceByKeyLocally stage, and immediately
upon finishing the preceding stage it emitted ~190K {{FetchFailure}}s over ~200 attempts of
the stage in about one minute, followed by some Snappy errors and the job shutting down.
* 2000 partitions, {{spark.file.transferTo=false}}: [stdout|].
This one had ~150 FetchFailures out of the gate, 
seemingly ran fine for ~8mins, then had a futures timeout, seemingly ran find for another
~17m, then got to my reduceByKeyLocally stage and died from Snappy errors.
* 2000 partitions, {{spark.file.transferTo=true}}: [stdout|].
Before running the above two, I was hoping that {{spark.file.transferTo=false}} was going
to fix my problems, so I ran this to see whether >2000 partitions was the determining factor
in the Snappy errors happening, as [~joshrosen] suggested in this thread. No such luck! ~15
FetchFailures right away, ran fine for 24mins, got to reduceByKeyLocally phase, Snappy-failed
and died.
* these and other stdout logs can be found [here|]

In all of these I was running on a dataset (~170GB) that should be easily handled by my cluster
(5TB RAM total), and in fact I successfully ran this job against this dataset last night using
a Spark 1.1 build. That job was dying of FetchFailures when I tried to run against a larger
dataset (~300GB), and I thought maybe I needed shuffle sorting or external shuffle service,
or other 1.2.0 goodies, so I've been trying to run with 1.2.0 but can't get anything to finish.

This job reads a file in from hadoop, coalesces to the number of partitions I've asked for,
and does a {{flatMap}}, a {reduceByKey}}, a map, and a {{reduceByKeyLocally}}. I am pretty
confident that the {{Map}} I'm materializing onto the driver in the {{reduceByKeyLocally}}
is a reasonable size; it's a {{Map[Long, Long]}} with about 40K entries, and I've actually
successfully run this job on this data to materialize that exact map at different points this
week, as I mentioned before. Something causes this job to die almost immediately upon starting
the {{reduceByKeyLocally}} phase, however, usually just with Snappy errors, but with a preponderance
of FetchFailures preceding them in my last attempt.

Let me know what other information I can provide that might be useful. Thanks!

> Identify cause of Kryo+Snappy PARSING_ERROR
> -------------------------------------------
>                 Key: SPARK-3630
>                 URL:
>             Project: Spark
>          Issue Type: Task
>          Components: Spark Core
>    Affects Versions: 1.1.0, 1.2.0
>            Reporter: Andrew Ash
>            Assignee: Josh Rosen
> A recent GraphX commit caused non-deterministic exceptions in unit tests so it was reverted
(see SPARK-3400).
> Separately, [~aash] observed the same exception stacktrace in an application-specific
Kryo registrator:
> {noformat}
> com.esotericsoftware.kryo.KryoException: failed to uncompress the


> com.esotericsoftware.kryo.serializers.DefaultSerializers$

> com.esotericsoftware.kryo.serializers.DefaultSerializers$

> com.esotericsoftware.kryo.Kryo.readClassAndObject( 


> com.esotericsoftware.kryo.Kryo.readClassAndObject(
> ...
> {noformat}
> This ticket is to identify the cause of the exception in the GraphX commit so the faulty
commit can be fixed and merged back into master.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message