drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aman Sinha <asi...@maprtech.com>
Subject Re: [DISCUSS] Insert into Table support
Date Wed, 05 Aug 2015 22:01:18 GMT
@Jacques..
Since the topic of  metadata caching is closely related, the main issue is
how is the metadata file maintained in the presence of either concurrent
INSERTs or concurrent REFRESH METADATA operations ?  One could maintain
multiple versions of the metadata file or keep the version information
inside a single metadata file.  Is that what you were thinking when you
mention Lucene's index versioning ? (I am not familiar with lucene's
versioning).

Aman

On Wed, Aug 5, 2015 at 12:02 PM, Chris Westin <chriswestin42@gmail.com>
wrote:

> Re #7 in the original post "Select table syntax can specify constant values
> for one or more columns:"
> I would have assumed the select list can have any expressions that can be
> evaluated on a row from the source; that includes columns, expressions on
> columns, or constants. It's probably not your intent, but the stated form
> implies that all I get are column values and constants. Which is it?
>
> On Mon, Jul 27, 2015 at 5:40 PM, Mehant Baid <baid.mehant@gmail.com>
> wrote:
>
> > I wanted to start a conversation around supporting the "Insert into
> Table"
> > feature. As of 1.2 we initially want to support inserting into a table
> with
> > Parquet files. Support for Json, CSV and other sources will follow as
> > future enhancements.
> >
> > Aman, Jinfeng, Neeraja and I had an initial discussion about this and
> > Neeraja provided a good summary of our discussion (pasted below) also
> > stating some of the requirements for this feature.
> >
> >  A ) Support Insert into a non-partitioned table
> > ---------------------
> >
> > Ex: INSERT INTO T1 [col1, col2, col3]  SELECT col4, col5, col6 from T2
> > (Source table: T2, Target table T1)
> > Requirements:
> >
> > 1. Target table column list specification is optional for Insert
> statement
> > 2. When specified, the column list in the Insert statement should
> >    contain all the columns present in the target table (i.e No support
> >    for partial insert)
> > 3. The column names specified for the source table do not need to match
> >    to the target table column names. Match is performed based on ordinal.
> > 4.   # of Source table columns specified must be same as # of target
> >    table columns
> > 5. Types of specified source table columns must match to the types of
> >    target table columns
> > 6. Specification of * is not allowed in the Select table syntax
> > 7. Select table syntax can specify constant values for one or more
> columns
> >
> >
> >  B ) Support insert into a partitioned table
> > ----------------------
> >
> > Ex: INSERT INTO T1 col1, col2,col3  partition by col1,col2 SELECT
> > col4,col,col6 from T2
> >
> >  * Target column specification is required when inserting data into an
> >    already partitioned table
> >  * Requirements A.3-A.7 above apply for insert into partitioned tables
> >    as well
> >  * A partition by clause along with one or more columns is required
> >  * All the columns specified in partition by clause must exist in the
> >    target column list
> >  * Partition by columns specified do not need to match to the list of
> >    columns that the original table partitioned with (i.e if the
> >    original table is partitioned with col1, col2,  new data during
> >    insert can be partitioned by col3 or just with col1 or col2..)
> >
> >
> > Couple of open questions from the design perspective are
> >
> > 1. How do we perform validation. Validation of data types, number of
> > columns being inserted etc. In addition to validation we need to make
> sure
> > that when we insert into an existing tables we insert data with the
> > existing column names (select column list can have different names). This
> > poses problems around needing to know the metadata at planning time, two
> > approaches that have been floating around are
> >         * DotDrill files: We can store metadata, partitioning columns and
> > other useful information here and we can perform validation during
> planning
> > time. However the challenges with introducing DotDrill files include
> >              - consistency between metadata and the actual data (Nothing
> > preventing users to copy files directly).
> >              - security around DotDrill files (can be dealt in the same
> > way we perform security checks for drill tables in hdfs)
> >              - interface to change the DotDrill file, in the case we need
> > to add a column to the table or add a new partition etc.
> >
> >         * Explicit Syntax/ No metadata approach: Another approach is to
> > avoid DotDrill files and use explicit syntax to glean as much information
> > as possible from the SQL statement itself. Some of the challenges with
> this
> > approach are
> >              - Gathering metadata information: Since we have no idea what
> > the existing schema is we would need to perform a "mini scan" to learn
> the
> > schema at planning time to be able to perform some validation. The
> problem
> > with this approach is how do we determine how many files we need to read
> in
> > order to learn the schema? If we use a sample set and not all the files
> > have the same schema,
> >                 we could have non-deterministic results based on the
> > sample of files read. Also reading all the files and merging the schema
> > seems like an expensive cost to pay.
> >              - From the user's perspective, while inserting into a
> > partitioned table, user will have to specify the partitioning columns
> again
> > in the Insert statement, despite having specified the partition columns
> in
> > the CTAS.
> >
> > 2. What is a reasonable assumption for a Drill table in terms of changing
> > schema. Having the same exact schema for all files in a table is too
> rigid
> > an assumption at this point?
> >
> > One thing to remember with DotDrill file is to also the repercussions on
> > Drop table, Show tables, Describe table etc. i.e. it might make it easier
> > to be able to support these operations.
> >
> > Thanks
> > Mehant
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message