spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brad Willard (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-8128) Schema Merging Broken: Dataframe Fails to Recognize Column in Schema
Date Fri, 12 Jun 2015 14:53:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Brad Willard updated SPARK-8128:
--------------------------------
    Description: 
I'm loading a folder of parquet files with about 600 parquet files and loading it into one
dataframe so schema merging is involved. There is some bug with the schema merging that you
print the schema and it shows all the attributes. However when you run a query and filter
on that attribute is errors saying it's not in the schema. The query is incorrectly going
to one of the parquet files that does not have that attribute.

sdf = sql_context.parquet('/parquet/big_data_folder')
sdf.printSchema()
root
 \|-- _id: string (nullable = true)
 \|-- addedOn: string (nullable = true)
 \|-- attachment: string (nullable = true)
 .......
\|-- items: array (nullable = true)
 \|    |-- element: struct (containsNull = true)
 \|    |    |-- _id: string (nullable = true)
 \|    |    |-- addedOn: string (nullable = true)
 \|    |    |-- authorId: string (nullable = true)
 \|    |    |-- mediaProcessingState: long (nullable = true)
 \|-- mediaProcessingState: long (nullable = true)
 \|-- title: string (nullable = true)
 \|-- key: string (nullable = true)

sdf.filter(sdf.mediaProcessingState == 3).count()

causes this exception

Py4JJavaError: An error occurred while calling o67.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1106 in stage 4.0
failed 30 times, most recent failure: Lost task 1106.29 in stage 4.0 (TID 70565, XXXXXXXXXXXXXXX):
java.lang.IllegalArgumentException: Column [mediaProcessingState] was not found in schema!
    at parquet.Preconditions.checkArgument(Preconditions.java:47)
    at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
    at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
    at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
    at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
    at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
    at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
    at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
    at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
    at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
    at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
    at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
    at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
    at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)


You also get the same error if you register it as a temp table and try to execute the same
sql query.

  was:
I'm loading a folder of parquet files with about 600 parquet files and loading it into one
dataframe so schema merging is involved. There is some bug with the schema merging that you
print the schema and it shows and attributes. However when you run a query and filter on that
attribute is errors saying it's not in the schema. The query is incorrectly going to one of
the parquet files that does not have that attribute.

sdf = sql_context.parquet('/parquet/big_data_folder')
sdf.printSchema()
root
 \|-- _id: string (nullable = true)
 \|-- addedOn: string (nullable = true)
 \|-- attachment: string (nullable = true)
 .......
\|-- items: array (nullable = true)
 \|    |-- element: struct (containsNull = true)
 \|    |    |-- _id: string (nullable = true)
 \|    |    |-- addedOn: string (nullable = true)
 \|    |    |-- authorId: string (nullable = true)
 \|    |    |-- mediaProcessingState: long (nullable = true)
 \|-- mediaProcessingState: long (nullable = true)
 \|-- title: string (nullable = true)
 \|-- key: string (nullable = true)

sdf.filter(sdf.mediaProcessingState == 3).count()

causes this exception

Py4JJavaError: An error occurred while calling o67.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1106 in stage 4.0
failed 30 times, most recent failure: Lost task 1106.29 in stage 4.0 (TID 70565, XXXXXXXXXXXXXXX):
java.lang.IllegalArgumentException: Column [mediaProcessingState] was not found in schema!
    at parquet.Preconditions.checkArgument(Preconditions.java:47)
    at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
    at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
    at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
    at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
    at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
    at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
    at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
    at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
    at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
    at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
    at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
    at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
    at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)


You also get the same error if you register it as a temp table and try to execute the same
sql query.


> Schema Merging Broken: Dataframe Fails to Recognize Column in Schema
> --------------------------------------------------------------------
>
>                 Key: SPARK-8128
>                 URL: https://issues.apache.org/jira/browse/SPARK-8128
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 1.3.0, 1.3.1, 1.4.0
>            Reporter: Brad Willard
>
> I'm loading a folder of parquet files with about 600 parquet files and loading it into
one dataframe so schema merging is involved. There is some bug with the schema merging that
you print the schema and it shows all the attributes. However when you run a query and filter
on that attribute is errors saying it's not in the schema. The query is incorrectly going
to one of the parquet files that does not have that attribute.
> sdf = sql_context.parquet('/parquet/big_data_folder')
> sdf.printSchema()
> root
>  \|-- _id: string (nullable = true)
>  \|-- addedOn: string (nullable = true)
>  \|-- attachment: string (nullable = true)
>  .......
> \|-- items: array (nullable = true)
>  \|    |-- element: struct (containsNull = true)
>  \|    |    |-- _id: string (nullable = true)
>  \|    |    |-- addedOn: string (nullable = true)
>  \|    |    |-- authorId: string (nullable = true)
>  \|    |    |-- mediaProcessingState: long (nullable = true)
>  \|-- mediaProcessingState: long (nullable = true)
>  \|-- title: string (nullable = true)
>  \|-- key: string (nullable = true)
> sdf.filter(sdf.mediaProcessingState == 3).count()
> causes this exception
> Py4JJavaError: An error occurred while calling o67.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1106 in stage
4.0 failed 30 times, most recent failure: Lost task 1106.29 in stage 4.0 (TID 70565, XXXXXXXXXXXXXXX):
java.lang.IllegalArgumentException: Column [mediaProcessingState] was not found in schema!
>     at parquet.Preconditions.checkArgument(Preconditions.java:47)
>     at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>     at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>     at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>     at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>     at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>     at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>     at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
>     at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
>     at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>     at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>     at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>     at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
>     at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
>     at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133)
>     at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
>     at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>     at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>     at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>     at org.apache.spark.scheduler.Task.run(Task.scala:64)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>     at java.lang.Thread.run(Thread.java:745)
> You also get the same error if you register it as a temp table and try to execute the
same sql query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message