drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (DRILL-3534) Insert into table support
Date Tue, 22 Aug 2017 18:18:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16137129#comment-16137129
] 

Paul Rogers edited comment on DRILL-3534 at 8/22/17 6:17 PM:
-------------------------------------------------------------

As John noted, the easiest way to envision {{INSERT INTO}} for big data would be to add new
files to an existing directory.

Fortunately for Drill, we can learn from existing solution such as [Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingvaluesintotablesfromSQL]
and [Impala|https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_insert.html].
We should certainly learn from the strengths of those solutions, while trying to avoid any
pitfalls.

At the data level, two variations are possible:

* For file formats that allow it, Drill might append to the file. (CSV, TSV, JSON for example.)
(But, note that HDFS does not allow append operations though MFS and Linux do.)
* For file formats that do not allow appending (e.g. Parquet), then adding files to partition
directories is a solution.

Drill is schema-on-read, which means that Drill, in general, does not know the schema of data
until it "looks." For a CSV file, this means we don't know the file schema. (Although, for
CSV files with headers, Drill could read the header to approximate the schema.) So, it may
be that it falls to the user to be responsible to append only data that fits the existing
schema. Since Drill (but not its clients) handle schema change, Drill can deal with (some)
files that change schemas in the middle of the file, though Tableau and similar xDBC tools
cannot.

The difficulties emerge, however, at the physical level: dealing with concurrency in a distributed,
multi-user system. A key contribution of databases is that they have kernels that deal with
this physical layer (and do such a great job that most of us never have to think about the
issues.)

* When appending to a file, what do readers see? Do readers see a possibly incomplete last
record? Do they see an atomic update? (All the inserts or none?)
* What is the guarantee of consistency? What is the expected behavior if a Drillbit crashes
during an INSERT? OK to leave a file half written, or must inserts be atomic?
* What happens if two queries try to insert into the same file? How are inserts synchronized?
(Or, does the last update win? Or, can files simply be overwritten and it is up to the user
to coordinate writes?)
* If the insert involves metadata (Parquet metadata, say) how are the metadata updates synchronized
with the INSERTs to provide a consistent view to readers?
* If an INSERT creates multiple files (as part of a partition directory), what are the semantics
for readers?

These issues are not at all at the logical SQL level; they are at the physical level: dealing
with the reality of a concurrent system.

The problems are made more complex because file systems do not provide the semantics needed
for DB-level synchronization. This is why the world needs Oracle (and MySQL and Postgres):
one cannot easily build DB semantics just using file system mechanisms. (Though, many have
tried.)

Possible approaches:

* The traditional solution is to use locking, transactions and the like to synchronize work.
These are (likely) not available to a query engine such as Drill.
* Leverage a meta-store (such as Hive) to handle the synchronization.
* Leverage ZK to handle synchronization of just the insert/read operations (e.g. place a marker
in ZK to "lock" a file for updates.)
* Design a log-based approach, based on multi-version concurrency control (MVCC) that allows
writers to create new versions of files while readers read old versions. The log file could
be the coordination mechanism. An update is not visible until the required entries appear
in the log. There is a large literature about using logs for synchronization.

All this said, it is worth considering another proposition: that Drill is a query engine,
not a DB. In this view, Drill should do nothing that requires DB-like concurrency semantics;
Drill should be purely read-only. Accordingly, Drill should not attempt to add {{INSERT}}
or {{UPDATE}} semantics: that is the proper domain of other tools. (With the logical consequence
that Drill should not be used for ETL.) It is worth debating if that is a viable approach,
or if users pretty much need the {{INSERT INTO}} semantics (with all the resulting complexity.)

Let the brainstorming begin!


was (Author: paul.rogers):
As John noted, the easiest way to envision {{INSERT INTO}} for big data would be to add new
files to an existing directory.

Fortunately for Drill, we can learn from existing solution such as [Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingvaluesintotablesfromSQL]
and [Impala|https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_insert.html].
We should certainly learn from the strengths of those solutions, while trying to avoid any
pitfalls.

At the data level, two variations are possible:

* For file formats that allow it, Drill might append to the file. (CSV, TSV, JSON for example.)
(But, note that HDFS does not allow append operations though MFS and Linux do.)
* For file formats that do not allow appending (e.g. Parquet), then adding files to partition
directories is a solution.

Drill is schema-on-read, which means that Drill, in general, does not know the schema of data
until it "looks." For a CSV file, this means we don't know the file schema. (Although, for
CSV files with headers, Drill could read the header to approximate the schema.) So, it may
be that it falls to the user to be responsible to append only data that fits the existing
schema. Since Drill (but not its clients) handle schema change, Drill can deal with (some)
files that change schemas in the middle of the file, though Tableau and similar xDBC tools
cannot.

The difficulties emerge, however, at the physical level: dealing with concurrency in a distributed,
multi-user system. A key contribution of databases is that they have kernels that deal with
this physical layer (and do such a great job that most of us never have to think about the
issues.)

* When appending to a file, what do readers see? Do readers see a possibly incomplete last
record? Do they see an atomic update? (All the inserts or none?)
* What is the guarantee of consistency? What is the expected behavior if a Drillbit crashes
during an INSERT? OK to leave a file half written, or must inserts be atomic?
* What happens if two queries try to insert into the same file? How are inserts synchronized?
(Or, does the last update win? Or, can files simply be overwritten and it is up to the user
to coordinate writes?)
* If the insert involves metadata (Parquet metadata, say) how are the metadata updates synchronized
with the INSERTs to provide a consistent view to readers?
* If an INSERT creates multiple files (as part of a partition directory), what are the semantics
for readers?

These issues are not at all at the logical SQL level; they are at the physical level: dealing
with the reality of a concurrent system.

The problems are made more complex because file systems do not provide the semantics needed
for DB-level synchronization. This is why the world needs Oracle (and MySQL and Postgres):
one cannot easily build DB semantics just using file system mechanisms. (Though, many have
tried.)

Possible approaches:

* The traditional solution is to use locking, transactions and the like to synchronize work.
These are (likely) not available to a query engine such as Drill.
* Leverage a meta-store (such as Hive) to handle the synchronization.
* Leverage ZK to handle synchronization of just the insert/read operations (e.g. place a marker
in ZK to "lock" a file for updates.)
* Design a log-based approach, based on multi-version concurrency control (MVCC) that allows
writers to create new versions of files while readers read old versions. The log file could
be the coordination mechanism. An update is not visible until the required entries appear
in the log. There is a large literature about using logs for synchronization.

Let the brainstorming begin!

> Insert into table support
> -------------------------
>
>                 Key: DRILL-3534
>                 URL: https://issues.apache.org/jira/browse/DRILL-3534
>             Project: Apache Drill
>          Issue Type: New Feature
>            Reporter: Mehant Baid
>             Fix For: Future
>
>
> Umbrella JIRA to track the "Insert into table" feature. More details regarding the scope,
design etc will follow as things start to materialize. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message