drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Westin <chriswesti...@gmail.com>
Subject Re: [DISCUSS] Insert into Table support
Date Wed, 05 Aug 2015 19:02:55 GMT
Re #7 in the original post "Select table syntax can specify constant values
for one or more columns:"
I would have assumed the select list can have any expressions that can be
evaluated on a row from the source; that includes columns, expressions on
columns, or constants. It's probably not your intent, but the stated form
implies that all I get are column values and constants. Which is it?

On Mon, Jul 27, 2015 at 5:40 PM, Mehant Baid <baid.mehant@gmail.com> wrote:

> I wanted to start a conversation around supporting the "Insert into Table"
> feature. As of 1.2 we initially want to support inserting into a table with
> Parquet files. Support for Json, CSV and other sources will follow as
> future enhancements.
>
> Aman, Jinfeng, Neeraja and I had an initial discussion about this and
> Neeraja provided a good summary of our discussion (pasted below) also
> stating some of the requirements for this feature.
>
>  A ) Support Insert into a non-partitioned table
> ---------------------
>
> Ex: INSERT INTO T1 [col1, col2, col3]  SELECT col4, col5, col6 from T2
> (Source table: T2, Target table T1)
> Requirements:
>
> 1. Target table column list specification is optional for Insert statement
> 2. When specified, the column list in the Insert statement should
>    contain all the columns present in the target table (i.e No support
>    for partial insert)
> 3. The column names specified for the source table do not need to match
>    to the target table column names. Match is performed based on ordinal.
> 4.   # of Source table columns specified must be same as # of target
>    table columns
> 5. Types of specified source table columns must match to the types of
>    target table columns
> 6. Specification of * is not allowed in the Select table syntax
> 7. Select table syntax can specify constant values for one or more columns
>
>
>  B ) Support insert into a partitioned table
> ----------------------
>
> Ex: INSERT INTO T1 col1, col2,col3  partition by col1,col2 SELECT
> col4,col,col6 from T2
>
>  * Target column specification is required when inserting data into an
>    already partitioned table
>  * Requirements A.3-A.7 above apply for insert into partitioned tables
>    as well
>  * A partition by clause along with one or more columns is required
>  * All the columns specified in partition by clause must exist in the
>    target column list
>  * Partition by columns specified do not need to match to the list of
>    columns that the original table partitioned with (i.e if the
>    original table is partitioned with col1, col2,  new data during
>    insert can be partitioned by col3 or just with col1 or col2..)
>
>
> Couple of open questions from the design perspective are
>
> 1. How do we perform validation. Validation of data types, number of
> columns being inserted etc. In addition to validation we need to make sure
> that when we insert into an existing tables we insert data with the
> existing column names (select column list can have different names). This
> poses problems around needing to know the metadata at planning time, two
> approaches that have been floating around are
>         * DotDrill files: We can store metadata, partitioning columns and
> other useful information here and we can perform validation during planning
> time. However the challenges with introducing DotDrill files include
>              - consistency between metadata and the actual data (Nothing
> preventing users to copy files directly).
>              - security around DotDrill files (can be dealt in the same
> way we perform security checks for drill tables in hdfs)
>              - interface to change the DotDrill file, in the case we need
> to add a column to the table or add a new partition etc.
>
>         * Explicit Syntax/ No metadata approach: Another approach is to
> avoid DotDrill files and use explicit syntax to glean as much information
> as possible from the SQL statement itself. Some of the challenges with this
> approach are
>              - Gathering metadata information: Since we have no idea what
> the existing schema is we would need to perform a "mini scan" to learn the
> schema at planning time to be able to perform some validation. The problem
> with this approach is how do we determine how many files we need to read in
> order to learn the schema? If we use a sample set and not all the files
> have the same schema,
>                 we could have non-deterministic results based on the
> sample of files read. Also reading all the files and merging the schema
> seems like an expensive cost to pay.
>              - From the user's perspective, while inserting into a
> partitioned table, user will have to specify the partitioning columns again
> in the Insert statement, despite having specified the partition columns in
> the CTAS.
>
> 2. What is a reasonable assumption for a Drill table in terms of changing
> schema. Having the same exact schema for all files in a table is too rigid
> an assumption at this point?
>
> One thing to remember with DotDrill file is to also the repercussions on
> Drop table, Show tables, Describe table etc. i.e. it might make it easier
> to be able to support these operations.
>
> Thanks
> Mehant
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message