impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Wood <>
Subject Re: Data Transfer Between Different Databases
Date Tue, 26 Sep 2017 23:18:21 GMT
​Hello Sky,
First, I'm sorry I missed your note the week it came in.

As I can read your questions from several different perspectives, I'll just
share a few general ideas and suggestions.

There are a few ways to connect up Impala with lots of data.  Several of
them trade off preparation time and effort in advance in exchange for
performance with reduced error checking, for example.  A series of INSERT
statements is inefficient, as you point out, because it does not amortize
the per-query overhead over the volume of data, and it checks every value
of every incoming row.

It's not clear which imperfections of Sqoop you refer to, however Impala
does support loading data into HDFS with Sqoop, then defining a schema on
top of it after the fact.  If you know your complete schema and have high
confidence it fits the data you loaded, you can use CREATE TABLE ...
LOCATION ... to make the new definition point to the newly-loaded files.
If you load partitioned data, you can follow these commands with ALTER
TABLE ... RECOVER PARTITIONS and Impala will find new rows loaded into
partition directories and bind them to the table.

Impala has a limited ability to discover a schema for loaded data, if the
destination format contains enough metadata.  For example, you could load
data into HDFS in Parquet format, then issue CREATE TABLE ... LIKE PARQUET
..., referencing the new files, and Impala will build that table's metadata
from the files.  Column types would be limited to those representable in
Parquet, and Parquet is the only format for which Impala implements this

Finally, the LOAD DATA command allows you to populate already-created
tables in Impala with data from another file *already stored in HDFS*. LOAD
DATA does not populate tables from arbitrary files in the OS filesystem

Hope this helps!

---------- Forwarded message ----------
> From: sky <>
> Date: Wed, Sep 13, 2017 at 7:08 PM
> ​Im​
> Subject: Data Transfer Between Different Databases
> To: "" <>
> Hi all,
>     How does impala interact data with other relational databases?
>  Sqoop's functionality is not perfect, and in impala, each insert has 100ms
> query plan overhead. Are there any other easy ways to interact ?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message