Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 5ED7E200D31 for ; Sat, 4 Nov 2017 17:34:05 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 5D5811609EE; Sat, 4 Nov 2017 16:34:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A3AEF160BE9 for ; Sat, 4 Nov 2017 17:34:04 +0100 (CET) Received: (qmail 84623 invoked by uid 500); 4 Nov 2017 16:34:03 -0000 Mailing-List: contact reviews-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@spark.apache.org Received: (qmail 84610 invoked by uid 99); 4 Nov 2017 16:34:03 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 04 Nov 2017 16:34:03 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 5C199DFC25; Sat, 4 Nov 2017 16:34:01 +0000 (UTC) From: cloud-fan To: reviews@spark.apache.org Reply-To: reviews@spark.apache.org References: In-Reply-To: Subject: [GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O... Content-Type: text/plain Message-Id: <20171104163402.5C199DFC25@git1-us-west.apache.org> Date: Sat, 4 Nov 2017 16:34:01 +0000 (UTC) archived-at: Sat, 04 Nov 2017 16:34:05 -0000 Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19651#discussion_r148935336 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala --- @@ -39,3 +58,134 @@ private[sql] object OrcFileFormat { names.foreach(checkFieldName) } } + +class DefaultSource extends OrcFileFormat + +/** + * New ORC File Format based on Apache ORC 1.4.1 and above. + */ +class OrcFileFormat + extends FileFormat + with DataSourceRegister + with Serializable { + + override def shortName(): String = "orc" + + override def toString: String = "ORC_1.4" + + override def hashCode(): Int = getClass.hashCode() + + override def equals(other: Any): Boolean = other.isInstanceOf[OrcFileFormat] + + override def inferSchema( + sparkSession: SparkSession, + options: Map[String, String], + files: Seq[FileStatus]): Option[StructType] = { + OrcUtils.readSchema(sparkSession, files) + } + + override def prepareWrite( + sparkSession: SparkSession, + job: Job, + options: Map[String, String], + dataSchema: StructType): OutputWriterFactory = { + val orcOptions = new OrcOptions(options, sparkSession.sessionState.conf) + + val conf = job.getConfiguration + + conf.set(MAPRED_OUTPUT_SCHEMA.getAttribute, OrcUtils.getSchemaString(dataSchema)) + + conf.set(COMPRESS.getAttribute, orcOptions.compressionCodec) + + conf.asInstanceOf[JobConf] + .setOutputFormat(classOf[org.apache.orc.mapred.OrcOutputFormat[OrcStruct]]) + + new OutputWriterFactory { + override def newInstance( + path: String, + dataSchema: StructType, + context: TaskAttemptContext): OutputWriter = { + new OrcOutputWriter(path, dataSchema, context) + } + + override def getFileExtension(context: TaskAttemptContext): String = { + val compressionExtension: String = { + val name = context.getConfiguration.get(COMPRESS.getAttribute) + OrcOptions.extensionsForCompressionCodecNames.getOrElse(name, "") + } + + compressionExtension + ".orc" + } + } + } + + override def isSplitable( + sparkSession: SparkSession, + options: Map[String, String], + path: Path): Boolean = { + true + } + + override def buildReaderWithPartitionValues( --- End diff -- we should override `buildReader` and return `GenericInternalRow` here. Then the parent class will merge the partition values and output `UnsafeRow`. This is what the current `OrcFileFormat` does and let's keep it first. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org For additional commands, e-mail: reviews-help@spark.apache.org