spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Harish (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-18789) Save Data frame with Null column-- exception
Date Fri, 17 Mar 2017 01:01:50 GMT

    [ https://issues.apache.org/jira/browse/SPARK-18789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929259#comment-15929259
] 

Harish commented on SPARK-18789:
--------------------------------

When you create the DF (dynamic) withough knowing the type of the column then you cant define
the schema. In my case i am not  knowing the type of a column. When you dont define the column
type and if the entire column in None then i am getting this error message. i hope i am clear.

> Save Data frame with Null column-- exception
> --------------------------------------------
>
>                 Key: SPARK-18789
>                 URL: https://issues.apache.org/jira/browse/SPARK-18789
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 2.0.2
>            Reporter: Harish
>
> I am trying to save a DF to HDFS which is having 1 column is NULL(no data).
> col1 col2 col3
> a       1     null
> b       1     null
> c        1    null
> d       1     null
> code :  df.write.format("orc").save(path, mode='overwrite')
> Error:
>   java.lang.IllegalArgumentException: Error: type expected at the position 49 of 'string:string:string:double:string:double:string:null'
but 'null' is found.
> 	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
> 	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
> 	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
> 	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
> 	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765)
> 	at org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104)
> 	at org.apache.spark.sql.hive.orc.OrcSerializer.<init>(OrcFileFormat.scala:182)
> 	at org.apache.spark.sql.hive.orc.OrcOutputWriter.<init>(OrcFileFormat.scala:225)
> 	at org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94)
> 	at org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
> 	at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
> 	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> 	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:86)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> 16/12/08 19:41:49 ERROR TaskSetManager: Task 17 in stage 512.0 failed 4 times; aborting
job
> 16/12/08 19:41:49 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in stage 512.0
failed 4 times, most recent failure: Lost task 17.3 in stage 512.0 (TID 37290, 10.63.136.108):
java.lang.IllegalArgumentException: Error: type expected at the position 49 of 'string:string:string:double:string:double:string:null'
but 'null' is found.
> 	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
> 	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
> 	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
> 	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
> 	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765)
> 	at org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104)
> 	at org.apache.spark.sql.hive.orc.OrcSerializer.<init>(OrcFileFormat.scala:182)
> 	at org.apache.spark.sql.hive.orc.OrcOutputWriter.<init>(OrcFileFormat.scala:225)
> 	at org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94)
> 	at org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
> 	at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
> 	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> 	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:86)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
> 	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
> 	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> 	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> 	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
> 	at scala.Option.foreach(Option.scala:257)
> 	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
> 	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
> 	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
> 	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
> 	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> 	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
> 	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1890)
> 	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1903)
> 	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1923)
> 	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:143)
> 	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
> 	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
> 	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
> 	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
> 	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
> 	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
> 	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> 	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> 	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> 	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> 	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
> 	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
> 	at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:510)
> 	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
> 	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
> 	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> 	at py4j.Gateway.invoke(Gateway.java:280)
> 	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> 	at py4j.commands.CallCommand.execute(CallCommand.java:79)
> 	at py4j.GatewayConnection.run(GatewayConnection.java:214)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.IllegalArgumentException: Error: type expected at the position 49
of 'string:string:string:double:string:double:string:null' but 'null' is found.
> 	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
> 	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
> 	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
> 	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
> 	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765)
> 	at org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104)
> 	at org.apache.spark.sql.hive.orc.OrcSerializer.<init>(OrcFileFormat.scala:182)
> 	at org.apache.spark.sql.hive.orc.OrcOutputWriter.<init>(OrcFileFormat.scala:225)
> 	at org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94)
> 	at org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
> 	at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
> 	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> 	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:86)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	... 1 more
> 16/12/08 19:41:49 WARN TaskSetManager: Lost task 12.3 in stage 512.0 (TID 37297, 10.63.136.108):
TaskKilled (killed intentionally)
> 16/12/08 19:41:49 WARN TaskSetManager: Lost task 11.3 in stage 512.0 (TID 37295, 10.63.136.108):
TaskKilled (killed intentionally)
> 16/12/08 19:41:49 WARN TaskSetManager: Lost task 16.3 in stage 512.0 (TID 37296, 10.63.136.108):
TaskKilled (killed intentionally)
> 16/12/08 19:41:49 WARN TaskSetManager: Lost task 8.3 in stage 512.0 (TID 37299, 10.63.136.108):
TaskKilled (killed intentionally)
> 16/12/08 19:41:49 WARN TaskSetManager: Lost task 0.3 in stage 512.0 (TID 37298, 10.63.136.108):
TaskKilled (killed intentionally)
> 16/12/08 19:41:49 ERROR DefaultWriterContainer: Job job_201612081941_0000 aborted.
> Traceback (most recent call last):
>   File "scripts/abc.py", line 134, in <module>
>    <<code>>
>   File "/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
line 547, in save
>   File "/spark-2.0.2-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py",
line 1133, in __call__
>   File "/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line
63, in deco
>   File "/spark-2.0.2-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o1857.save. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message