drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Venki Korukanti (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-3209) [Umbrella] Plan reads of Hive tables as native Drill reads when a native reader for the underlying table format exists
Date Tue, 22 Sep 2015 18:57:04 GMT

    [ https://issues.apache.org/jira/browse/DRILL-3209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903215#comment-14903215

Venki Korukanti commented on DRILL-3209:

I want to discuss the couple of approaches to support native reading of Hive tables. Please
let me know your thoughts.

# Add a {{StoragePluginOptimizerRule}} which converts {{HiveScan}} to {{ParquetGroupScan}}
or {{EasyGroupScan}}
#* Logical approach which aligns with existing rules. 
#* Can accommodate column renaming project. For example text scan returns columns as {{column\[0\]}}
which needs to renamed to column name given in Hive table DDL
#* Can accommodate conversion of data formats. For example Hive parquet stores the timestamp
as INT96 which is interpreted as VARBINARY by Drill's native parquet reader. This needs to
converted to Drill's TIMESTAMP type
#* Problem is {{ParquetGroupScan}} or {{EasyGroupScan}} have storage plugin references which
are not available in {{StoragePluginOptimizerRule}}. We could update the {{StoragePluginOptmizerRule}}
to expose storage plugin information. Even if we add the references of {{FileSystemStoragePlugin}},
fs plugin may be referencing one filesystem but the hive storage path may be referencing another
filesystem. In this case we can't directly use the existing {{FileSystemStoragePlugin}}.
#* Scanning of files under hive table location. Currently this code is in {{WorkspaceSchema.createTable{{
which is not directly accessible unless we pass the reference of {{SchemaPlus}} instance of
{{dfs.default}} schema into {{StoragePluginOptimizerRule}}.
#* Currently {{FileSystemPlugin}} can only read partitioned data if all the partition directories
are under a common directory. This is not *always* true for Hive tables which can have partitions
anywhere in the filesystem. In order to support this, we need to refactor {{FileSelection}}
#* Currently {{FileSystemPlugin}} derives partition values from file path which is not *always*
the case for Hive table which store the partition values in metastore and data is stored in
an arbitrary location.
# Hybrid of rule based conversion during logical optimization and execution changes
#* Instead of directly converting the {{HiveScan}} to {{ParquetGroupScan}}, convert {{HiveScan}}
to {{HiveNativeDrillScan}}. {{HiveNativeDrillScan}} is basically an extension of {{HiveScan}}
which overrides {{getCost()}} and returns different implementation of subscan. 
#* For {{HiveSubScan}} add a new method {{getRecordReader()}} that returns one or more RecordReaders
during fragment setup. Default implementation returns Hive's native RecordReader. {{HiveNativeDrillSubScan}}
(counterpart of {{HiveNativeDrillScan}} for subscan) returns Drill's native parquet or text
record reader.
#* Once {{HiveScan}} is converted to {{HiveNativeScan}}, any projects such as rename or data
format conversion can be added when as part of the rule.
#* One point that I need mention in this approach is {{HiveSubScan}} contains {{InputSplits}}
which in Parquet case needs to be converted into appropriate RowGroupNumber using the same
logic as what Hive uses ([here|https://github.com/apache/hive/blob/branch-1.0/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java#L196]).
Drill's native {{ParquetRecordReader}} expects RowGroupNumber.

> [Umbrella] Plan reads of Hive tables as native Drill reads when a native reader for the
underlying table format exists
> ----------------------------------------------------------------------------------------------------------------------
>                 Key: DRILL-3209
>                 URL: https://issues.apache.org/jira/browse/DRILL-3209
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Query Planning & Optimization, Storage - Hive
>            Reporter: Jason Altekruse
>            Assignee: Jason Altekruse
>             Fix For: 1.2.0
> All reads against Hive are currently done through the Hive Serde interface. While this
provides the most flexibility, the API is not optimized for maximum performance while reading
the data into Drill's native data structures. For Parquet and Text file backed tables, we
can plan these reads as Drill native reads. Currently reads of these file types provide untyped
data. While parquet has metadata in the file we currently do not make use of the type information
while planning. For text files we read all of the files as lists of varchars. In both of these
cases, casts will need to be injected to provide the same datatypes provided by the reads
through the SerDe interface.

This message was sent by Atlassian JIRA

View raw message