spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maciej Bryński (JIRA) <j...@apache.org>
Subject [jira] [Commented] (SPARK-16996) Hive ACID delta files not seen
Date Thu, 16 Nov 2017 09:19:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-16996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16254997#comment-16254997
] 

Maciej Bryński commented on SPARK-16996:
----------------------------------------

I changed hive libraries to version provided by Hortonworks.

Result:
{code}
 java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.IntWritable
        at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableIntObjectInspector.get(WritableIntObjectInspector.java:36)
        at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$5.apply(TableReader.scala:395)
        at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$5.apply(TableReader.scala:395)
        at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:438)
        at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:429)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
{code}

CC: [~stevel@apache.org]

> Hive ACID delta files not seen
> ------------------------------
>
>                 Key: SPARK-16996
>                 URL: https://issues.apache.org/jira/browse/SPARK-16996
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.2, 1.6.3, 2.1.2, 2.2.0
>         Environment: Hive 1.2.1, Spark 1.5.2
>            Reporter: Benjamin BONNET
>            Priority: Critical
>
> spark-sql seems not to see data stored as delta files in an ACID Hive table.
> Actually I encountered the same problem as describe here : http://stackoverflow.com/questions/35955666/spark-sql-is-not-returning-records-for-hive-transactional-tables-on-hdp
> For example, create an ACID table with HiveCLI and insert a row :
> {code}
> set hive.support.concurrency=true;
> set hive.enforce.bucketing=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> set hive.compactor.initiator.on=true;
> set hive.compactor.worker.threads=1;
>  CREATE TABLE deltas(cle string,valeur string) CLUSTERED BY (cle) INTO 1 BUCKETS
>     ROW FORMAT SERDE  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>     STORED AS 
>       INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>       OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>     TBLPROPERTIES ('transactional'='true');
> INSERT INTO deltas VALUES("a","a");
> {code}
> Then make a query with spark-sql CLI :
> {code}
> SELECT * FROM deltas;
> {code}
> That query gets no result and there are no errors in logs.
> If you go to HDFS to inspect table files, you find only deltas
> {code}
> ~>hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 1 items
> drwxr-x---   - me hdfs          0 2016-08-10 14:03 /apps/hive/warehouse/deltas/delta_0020943_0020943
> {code}
> Then if you run compaction on that table (in HiveCLI) :
> {code}
> ALTER TABLE deltas COMPACT 'MAJOR';
> {code}
> As a result, the delta will be compute into a base file :
> {code}
> ~>hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 1 items
> drwxrwxrwx   - me hdfs          0 2016-08-10 15:25 /apps/hive/warehouse/deltas/base_0020943
> {code}
> Go back to spark-sql and the same query gets a result :
> {code}
> SELECT * FROM deltas;
> a       a
> Time taken: 0.477 seconds, Fetched 1 row(s)
> {code}
> But next time you make an insert into Hive table : 
> {code}
> INSERT INTO deltas VALUES("b","b");
> {code}
> spark-sql will immediately see changes : 
> {code}
> SELECT * FROM deltas;
> a       a
> b       b
> Time taken: 0.122 seconds, Fetched 2 row(s)
> {code}
> Yet there was no other compaction, but spark-sql "sees" the base AND the delta file :
> {code}
> ~> hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 2 items
> drwxrwxrwx   - valdata hdfs          0 2016-08-10 15:25 /apps/hive/warehouse/deltas/base_0020943
> drwxr-x---   - valdata hdfs          0 2016-08-10 15:31 /apps/hive/warehouse/deltas/delta_0020956_0020956
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message