spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
Date Tue, 05 Apr 2016 00:46:25 GMT

    [ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225263#comment-15225263
] 

Hyukjin Kwon edited comment on SPARK-14103 at 4/5/16 12:45 AM:
---------------------------------------------------------------

Oh, sorry I should have mentioned that it reads all the data (including line separators) regardless
of a line separator once it meets a quote character which does not end.

In {{BulkCsvReader}}, it sort of uses a {{Reader}} converted from {{Iterator}}, meaning it
processes data not line by line in the point of Univocity parser.

If this were processed with {{Iterator}} with each line as a input, then it would be just
like you said but it is processed with {{Reader}} with whole data as input. So, this even
ignores line separators as well as delimiters which ends up reading whole data after a quote
as a value.

-(Actually this is one of the reasons why I am thinkng changing this library to Apache's.
It seems Univocity only takes input as {{Reader}} whereas Apache's takes {{String}}, which
can be easily produced from {{Iterator}} (as far as I remember).-


was (Author: hyukjin.kwon):
Oh, sorry I should have mentioned that it reads all the data (including line separators) regardless
of a line separator once it meets a quote character which does not end.

In {{BulkCsvReader}}, it sort of uses a {{Reader}} converted from {{Iterator}}, meaning it
processes data not line by line in the point of Univocity parser.

If this were processed with {{Iterator}} with each line as a input, then it would be just
like you said but it is processed with {{Reader}} with whole data as input. So, this even
ignores line separators as well as delimiters which ends up reading whole data after a quote
as a value.

(Actually this is one of the reasons why I am thinkng changing this library to Apache's. It
seems Univocity only takes input as {{Reader}} whereas Apache's takes {{String}}, which can
be easily produced from {{Iterator}} (as far as I remember).

> Python DataFrame CSV load on large file is writing to console in Ipython
> ------------------------------------------------------------------------
>
>                 Key: SPARK-14103
>                 URL: https://issues.apache.org/jira/browse/SPARK-14103
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>         Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master branch
>            Reporter: Shubhanshu Mishra
>              Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following command on a
large tab separated file then I get the contents of the file being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", inferSchema="true",
delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>                                                          (0 + 2) / 2]16/03/23
14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: Length of
parsed input (1000001) exceeds the maximum number of characters defined in your parser settings
(1000000). Identified line separator characters in the parsed content. This may be the cause
of the error. The line separator in your parser settings is set to '\n'. Parsed content:
>         Privacy-shake",: a haptic interface for managing privacy settings in mobile location
sharing applications       privacy shake a haptic interface for managing privacy settings
in mobile location sharing applications  2010    2010/09/07              international conference
on human computer interaction  interact                43331058        19371[\n]        3D4F6CA1
       Between the Profiles: Another such Bias. Technology Acceptance Studies on Social Network
Services       between the profiles another such bias technology acceptance studies on social
network services 2015    2015/08/02      10.1007/978-3-319-21383-5_12    international conference
on human-computer interaction  interact                43331058        19502[\n]
> .......
> .........
> web snippets    2008    2008/05/04      10.1007/978-3-642-01344-7_13    international
conference on web information systems and technologies    webist          44F29802       
19489
> 06FA3FFA        Interactive 3D User Interfaces for Neuroanatomy Exploration     interactive
3d user interfaces for neuroanatomy exploration     2009                    internationa]
>         at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
>         at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
>         at org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
>         at org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>         at org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
>         at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
>         at org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
>         at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
>         at org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
>         at org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
>         at org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
>         at org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
>         at org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
>         at org.apache.spark.scheduler.Task.run(Task.scala:82)
>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; aborting
job
> ^M[Stage 1:>                                                          (0 + 1) / 2]
> {code}
> For a small sample (<10,000 lines) of the data, I am not getting any error. But as
soon as I go above more than 100,000 samples, I start getting the error. 
> I don't think the spark platform should output the actual data to stderr ever as it decreases
the readability. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message