spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhan Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-2883) Spark Support for ORCFile format
Date Tue, 14 Oct 2014 23:04:35 GMT

    [ https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14171654#comment-14171654
] 

Zhan Zhang edited comment on SPARK-2883 at 10/14/14 11:04 PM:
--------------------------------------------------------------

I almost finished the prototype, and following is the draft spec for this jira. I will wrap
up my patch, and upload soon. This is a small patch with around 1,000 lines of code including
testing suite. Since there is another PR already opened with the duplicated Jira spark-3720
with this one, I may work with the author of that jira to consolidate our work. But if anybody
think opening another PR is better (not necessary to commit), please let me know.

1. Basic Operator: saveAsOrcFile and OrcFile. The former is used to save the table into orc
format file, and the latter is used to import orc format file into spark sql table.
2. Column pruning
3. Self-contained schema support: The orc support is fully functional independent of hive
metastore. The table schema is maintained by the orc file itself.
4. To support the orc file, user need to:  import import org.apache.spark.sql.hive.orc._ to
bring in the orc support into context
5. The orc file is operated in HiveContext, the only reason is due to package issue, and we
don’t want to bring in hive dependency into spark sql. Note that orc operations does not
relies on Hive metastore.
6. It support full complicated dataType in Spark Sql, for example, list, seq, and nested datatype.

Hive 0.13.1 support.
With minor change, after spark hive upgraded to 0.13.1
1. the orc can support different compression method, e.g., SNAPPY, LZO, ZLIB, and NONE
2. prediction pushdown

Following is the example to use orc file, which is almost identical to the parquet format
support from user perspective.

import org.apache.spark.sql.hive.orc._
val ctx = new org.apache.spark.sql.hive.HiveContext(sc)

val people = sc.textFile("examples/src/main/resources/people.txt")
val schemaString = "name age"
import org.apache.spark.sql._
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName,
StringType, true)))
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
val peopleSchemaRDD = ctx.applySchema(rowRDD, schema)
peopleSchemaRDD.registerTempTable("people")
val results = ctx.sql("SELECT name FROM people")
results.map(t => "Name: " + t(0)).collect().foreach(println)
peopleSchemaRDD.saveAsOrcFile("people.orc")
val orcFile = ctx.orcFile("people.orc")
orcFile.registerTempTable("orcFile")
val teenagers = ctx.sql("SELECT name FROM orcFile WHERE age >= 13 AND age <= 19")
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)


was (Author: zzhan):
I almost finished the prototype, and following is the draft spec for this jira. I will wrap
up my patch, and upload soon. This is a small patch with around 1,000 lines of code including
testing suite. Since there is another PR already opened with the duplicated Jira spark-3720
with this one, I may work with the author of that jira to consolidate our work. But if anybody
think opening another PR is better (not necessary to commit), please let me know.

1. Basic Operator: saveAsOrcFile and OrcFile. The former is used to save the table into orc
format file, and the latter is used to import orc format file into spark sql table.
2. Column pruning
3. Self-contained schema support: The orc support is fully functional independent of hive
metastore. The table schema is maintained by the orc file itself.
4. To support the orc file, user need to:  import import org.apache.spark.sql.hive.orc._ to
bring in the orc support into context
5. The orc file is operated in HiveContext, the only reason is due to package issue, and we
don’t want to bring in hive dependency into spark sql. Note that orc operations does not
relies on Hive metastore.
6. It support full complicated dataType in Spark Sql, for example, list, seq, and nested datatype.

Following is the example to use orc file, which is almost identical to the parquet format
support from user perspective.

import org.apache.spark.sql.hive.orc._
val ctx = new org.apache.spark.sql.hive.HiveContext(sc)

val people = sc.textFile("examples/src/main/resources/people.txt")
val schemaString = "name age"
import org.apache.spark.sql._
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName,
StringType, true)))
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
val peopleSchemaRDD = ctx.applySchema(rowRDD, schema)
peopleSchemaRDD.registerTempTable("people")
val results = ctx.sql("SELECT name FROM people")
results.map(t => "Name: " + t(0)).collect().foreach(println)
peopleSchemaRDD.saveAsOrcFile("people.orc")
val orcFile = ctx.orcFile("people.orc")
orcFile.registerTempTable("orcFile")
val teenagers = ctx.sql("SELECT name FROM orcFile WHERE age >= 13 AND age <= 19")
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

Hive 0.13.1 support.
With minor change, after spark hive upgraded to 0.13.1
1. the orc can support different compression method, e.g., SNAPPY, LZO, ZLIB, and NONE
2. prediction pushdown

> Spark Support for ORCFile format
> --------------------------------
>
>                 Key: SPARK-2883
>                 URL: https://issues.apache.org/jira/browse/SPARK-2883
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, SQL
>            Reporter: Zhan Zhang
>            Priority: Blocker
>         Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 pm jobtracker.png
>
>
> Verify the support of OrcInputFormat in spark, fix issues if exists and add documentation
of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message