spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: Finding bad data
Date Fri, 25 Apr 2014 02:26:03 GMT
Hey Jim, this is unfortunately harder than I’d like right now, but here’s how to do it.
Look at the stderr file of the executor on that machine, and you’ll see lines like this:

14/04/24 19:17:24 INFO HadoopRDD: Input split: file:/Users/matei/workspace/apache-spark/README.md:0+2000

This says what file it was reading, as well as what byte offset (that’s the 0+2000 part).
Unfortunately, because the executor is running multiple tasks at the same time, this message
will be hard to associate with a particular task unless you only configure one core per executor.
But it may help you spot the file.

The other way you might do it is a map() on the data before you process it that checks for
error conditions. In that one you could print out the original input line.

I realize that neither of these is ideal. I’ve opened https://issues.apache.org/jira/browse/SPARK-1622
to try to expose this information somewhere else, ideally in the UI. The reason it wasn’t
done so far is because some tasks in Spark can be reading from multiple Hadoop InputSplits
(e.g. if you use coalesce(), or zip(), or similar), so it’s tough to do it in a super general
setting.

Matei

On Apr 24, 2014, at 6:15 PM, Jim Blomo <jim.blomo@gmail.com> wrote:

> I'm using PySpark to load some data and getting an error while
> parsing it.  Is it possible to find the source file and line of the bad
> data?  I imagine that this would be extremely tricky when dealing with
> multiple derived RRDs, so an answer with the caveat of "this only
> works when running .map() on an textFile() RDD" is totally fine.
> Perhaps if the line number and file was available in pyspark I could
> catch the exception and output it with the context?
> 
> Anyway to narrow down the problem input would be great. Thanks!


Mime
View raw message