hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Goldenberg <dgoldenb...@hexastax.com>
Subject Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?
Date Tue, 04 Apr 2017 17:53:45 GMT
Thanks, Dudu. I think there's a disconnect here. We're using LOAD INPATH on a few tables to
achieve the effect of actual insertion of records. Is it not the case that the LOAD causes
the data to get inserted into Hive?

Based on that I'd like to understand whether we can get away with using LOAD INPATH instead
of INSERT/SELECT FROM.

> On Apr 4, 2017, at 1:43 PM, Markovitz, Dudu <dmarkovitz@paypal.com> wrote:
> 
> I just want to verify that you understand the following:
>  
> ·         LOAD DATA INPATH is just a HDFS file movement operation.
> You can achieve the same results by using hdfs dfs -mv …
>  
> ·         LOAD DATA LOCAL  INPATH is just a file copying operation from the shell to
the HDFS.
> You can achieve the same results by using hdfs dfs -put …
>  
>  
> From: Dmitry Goldenberg [mailto:dgoldenberg@hexastax.com] 
> Sent: Tuesday, April 04, 2017 7:48 PM
> To: user@hive.apache.org
> Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET
table?
>  
> Dudu,
>  
> This is still in design stages, so we have a way to get the data from its source. The
data is *not* in the Parquet format.  It's up to us to format it the best and most efficient
way.  We can roll with CSV or Parquet; ultimately the data must make it into a pre-defined
PARQUET, PARTITIONED table in Hive.
>  
> Thanks,
> - Dmitry
>  
> On Tue, Apr 4, 2017 at 12:20 PM, Markovitz, Dudu <dmarkovitz@paypal.com> wrote:
> Are your files already in Parquet format?
>  
> From: Dmitry Goldenberg [mailto:dgoldenberg@hexastax.com] 
> Sent: Tuesday, April 04, 2017 7:03 PM
> To: user@hive.apache.org
> Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET
table?
>  
> Thanks, Dudu.
>  
> Just to re-iterate; the way I'm reading your response is that yes, we can use LOAD INPATH
for a PARQUET, PARTITIONED table, provided that the data in the delimited file is properly
formatted.  Then we can LOAD it into the table (mytable in my example) directly and avoid
the creation of the temp table (origtable in my example).  Correct so far?
>  
> I did not quite follow the latter part of your response:
> >> You should only create an external table which is an interface to read the files
and use it in an INSERT operation.
>  
> My assumption was that we would LOAD INPATH and not have to use INSERT altogether.  Am
I missing something in groking this latter part of your response?
>  
> Thanks,
> - Dmitry
>  
> On Tue, Apr 4, 2017 at 11:26 AM, Markovitz, Dudu <dmarkovitz@paypal.com> wrote:
> Since LOAD DATA INPATH  only moves files the answer is very simple.
> If you’re files are already in a format that matches the destination table (storage
type, number and types of columns etc.) then – yes and if not, then – no.
>  
> But –
> You don’t need to load the files into intermediary table.
> You should only create an external table which is an interface to read the files and
use it in an INSERT operation.
>  
> Dudu
>  
> From: Dmitry Goldenberg [mailto:dgoldenberg@hexastax.com] 
> Sent: Tuesday, April 04, 2017 4:52 PM
> To: user@hive.apache.org
> Subject: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET
table?
>  
> We have a table such as the following defined:
> CREATE TABLE IF NOT EXISTS db.mytable (
>   `item_id` string,
>   `timestamp` string,
>   `item_comments` string)
> PARTITIONED BY (`date`, `content_type`)
> STORED AS PARQUET;
> 
> Currently we insert data into this PARQUET, PARTITIONED table as follows, using an intermediary
table:
> 
> INSERT INTO TABLE db.mytable PARTITION(date, content_type)
> SELECT itemid as item_id, itemts as timestamp, date, content_type
> FROM db.origtable
> WHERE date = “${SELECTED_DATE}”
> GROUP BY item_id, date, content_type;
> 
> Our question is, would it be possible to use the LOAD DATA INPATH.. INTO TABLE syntax
to load the data from delimited data files into 'mytable' rather than populating mytable from
the intermediary table?
>  
> I see in the Hive documentation that:
> * Load operations are currently pure copy/move operations that move datafiles into locations
corresponding to Hive tables.
> * If the table is partitioned, then one must specify a specific partition of the table
by specifying values for all of the partitioning columns.
>  
> This seems to indicate that using LOAD is possible; however looking at this discussion:
http://grokbase.com/t/hive/user/114frbfg0y/can-i-use-hive-dynamic-partition-while-loading-data-into-tables,
perhaps not?
>  
> We'd like to understand if using LOAD in the case of PARQUET, PARTITIONED tables is possible
and if so, then how does one go about using LOAD in that case?
>  
> Thanks,
> - Dmitry
>  
>  
>  

Mime
View raw message