hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Goldenberg <dgoldenb...@hexastax.com>
Subject Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?
Date Thu, 06 Apr 2017 21:19:05 GMT
I'm assuming, given this:

CREATE TABLE IF NOT EXISTS db.mytable (
  `item_id` string,
  `timestamp` string,
  `item_comments` string)
PARTITIONED BY (`date`, `content_type`)
STORED AS PARQUET;

we'd have to organize the input Parquet files into subdirectories where
each subdirectory contains data just for the given 'date' (YYMMDD), then
within that subdirectory, content would be organized by content_type, one
file per content_type value.  How does Hive make the association of a
partition with a subdirectory naming or know to look for files for
content_type, and how would it match content_type='Presentation' -- would
the file just need to be named "Presentation"?


On Thu, Apr 6, 2017 at 5:05 PM, Dmitry Goldenberg <dgoldenberg@hexastax.com>
wrote:

> >> properly split and partition your data before using LOAD if you want
> hive to be able to find it again.
>
> If the destination table is defined as
> CREATE TABLE IF NOT EXISTS db.mytable (
>   `item_id` string,
>   `timestamp` string,
>   `item_comments` string)
> PARTITIONED BY (`date`, `content_type`)
> STORED AS PARQUET;
>
> and supposing that we have the data "in hand" (in memory or as CSV files)
> how does one go about the 'proper split and partition' so it adheres to:
> PARTITIONED BY (`date`, `content_type`)  ?
>
> Thanks
>
>
> On Thu, Apr 6, 2017 at 12:29 PM, Ryan Harris <Ryan.Harris@zionsbancorp.com
> > wrote:
>
>> “If we represent our data as delimited files” ….the question is how you
>> plan on getting your data into these parquet files since it doesn’t sound
>> like your data is already in that format….
>>
>>
>>
>> If your data is not already in parquet format, you are going to need to
>> run **some** process to get it into that format…why not just use hive
>> (running a query on an external table) to perform the conversion?
>>
>>
>>
>> “and Hive represents it as Parquet internally” That entirely depends on
>> the declared STORED AS format when you define the table.  The files backing
>> the hive table **could** be TEXT, sequence, RC, ORC, Parquet…  If you
>> declared the table to be backed by delimited text, you could format your
>> data into standard text files (not parquet) and then add the data to the
>> hive table using LOAD DATA.
>>
>>
>>
>> So, why NOT use text data for the table storage?  There is no way to
>> optimize future queries against that data.
>>
>>
>>
>> One hypothetical workflow assuming that your data is currently delimited….
>>
>>
>>
>> You could either have a hive managed table, with the table data stored as
>> TEXTFILE using some delimiter based SerDe, and you could then use LOAD DATA
>> to put your original raw files into this table.   OR, you could use an
>> external table (not managed by hive) to point to the data wherever it
>> currently resides.  (The only difference between the two choices here is
>> whether the original raw files end up in ‘/user/hive/warehouse/tablename’
>> or the current HDFS path where they reside.
>>
>>
>>
>> From there, you could query FROM that temp table, INSERT into your final
>> destination table, and the data will be formatted according to the data
>> definition of your destination table.
>>
>>
>>
>>
>>
>> If you want to (for whatever reason) use LOAD DATA INPATH to shove the
>> original data directly into your final destination table you must
>>
>> 1)      Ensure that the data is formatted into parquet files that are
>> compatible with the version of hive that you are running.   The parquet
>> format has been used by a number of different projects, unfortunately there
>> are different versions of parquet and it cannot be taken for granted that
>> any parquet file will be compatible with the version of hive you are
>> using.  Testing and validation is required…see
>> https://github.com/Parquet/parquet-compatibility
>>
>> 2)      Parquet files have internal partitioning to them, but from
>> hive’s perspective, hive partitions will still be separated into individual
>> directories.  You’ll need to ensure that you properly split and partition
>> your data before using LOAD if you want hive to be able to find it again.
>>
>>
>>
>> It doesn’t sound like your source data is currently formatted to match
>> your hive table formatting.  If you are already processing the data with a
>> spark pipeline and you just happened to set the output of that processing
>> to be delimited text and you can just as easily change it to something that
>> is compatible with your hive table….then that may make sense to do.
>> However, if you are going to require a separate processing step to convert
>> the data from delimited text to hive-compatible parquet, I don’t see a
>> reason to use any tool OTHER than hive to perform that conversion.
>>
>>
>>
>> LOAD DATA is generally used in situations where you **know** that the
>> data format is already 100% exactly compatible with your destination
>> table….which most often occurs when the source of the data is the raw data
>> backing an existing hive managed table (possibly copied/moved from a
>> different cluster).
>>
>>
>>
>>
>>
>>
>>
>> *From:* Dmitry Goldenberg [mailto:dgoldenberg@hexastax.com]
>> *Sent:* Thursday, April 06, 2017 6:48 AM
>> *To:* user@hive.apache.org
>> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a
>> PARTITIONED, STORED AS PARQUET table?
>>
>>
>>
>> [External Email]
>> ------------------------------
>>
>> Thanks, Ryan.
>>
>>
>>
>> I was actually more curious about scenario B. If we represent our data as
>> delimited files, why don't we just use LOAD DATA INPATH and load it right
>> into the final, parquet, partitioned table in one step, bypassing dealing
>> with the temp table?
>>
>>
>>
>> Are there any advantages to having a temp table besides the validation?
>> One advantage could possibly be making it a transactional table and being
>> able to run direct INSERT's into the temp table, avoiding having to deal
>> with delimited files and LOAD DATA INPATH.
>>
>>
>>
>> If we go with route B, LOAD DATA INPATH directly into the parquet,
>> partitioned table, would we have to:
>>
>>
>>
>> 1) represent the input files as Parquet? - it looks like the data is
>> still delimited, and Hive represents it as Parquet internally
>>
>> 2) do anything specific in the input files / with the input files in
>> order to make partitioning work, or does Hive just take the data and take
>> full care of partitioning it?
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Apr 4, 2017 at 6:14 PM, Ryan Harris <Ryan.Harris@zionsbancorp.com>
>> wrote:
>>
>> For A) I’d recommend mapping an EXTERNAL table to the raw/original source
>> files…then you can just run a SELECT query from the EXTERNAL source and
>> INSERT into your destination.
>>
>>
>>
>> LOAD DATA can be very useful when you are trying to move data between two
>> tables that share the same schema but 1 table is partitioned and the other
>> table is NOT partitioned…once the files have been inserted into the
>> unpartitioned table the source files from the hive warehouse can be added
>> to the partitioned table using LOAD DATA.  Another place I’ve frequently
>> used LOAD DATA is when synchronizing hive table data between two clusters,
>> the hive warehouse data files can be copied from one cluster to the other
>> with distcp and then loading the data flies to the duplicate cluster using
>> LOAD DATA to ensure the metadata is recorded in hive metastore.
>>
>>
>>
>> *From:* Dmitry Goldenberg [mailto:dgoldenberg@hexastax.com]
>> *Sent:* Tuesday, April 04, 2017 3:31 PM
>> *To:* user@hive.apache.org
>> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a
>> PARTITIONED, STORED AS PARQUET table?
>>
>>
>>
>> [External Email]
>> ------------------------------
>>
>> Right, that makes sense, Dudu.
>>
>>
>>
>> So basically, if we have our data in "some form", and a goal of loading
>> it into a parquet, partitioned table in Hive, we have two choices:
>>
>>
>>
>> A. Load this data into a temporary table first. Presumably, for this we
>> should be able to do a LOAD INPATH, from delimited data files. Perhaps we
>> could designate the temp table as transactional and then simply do direct
>> INSERT's into this temp table - ? Then, as the second step, we'd do an
>> INSERT... SELECT, to move the data into the destination table, and then
>> DROP the temp table.
>>
>>
>>
>> B. Represent the data as a delimited format and do a LOAD INPATH directly
>> into the destination table. Understandably, we lose the 'data verification'
>> this way. If we go this route, must the data in the input files be in the
>> PARQUET format or in a delimited format?  I would guess, the former.  And,
>> how does partitioning play into it?  How would the input data need to be
>> organized and inserted so as to adhere to the partitions (the 'date' and
>> 'content-type' columns, in my example)?
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Apr 4, 2017 at 2:22 PM, Markovitz, Dudu <dmarkovitz@paypal.com>
>> wrote:
>>
>> “LOAD” is very misleading here. it is all in done the metadata level.
>>
>> The data is not being touched. The data in not being verified. The
>> “system” does not have any clue if the flies format match the table
>> definition and they can be actually used.
>>
>> The data files are being “moved” (again,  a metadata operation) from
>> their current HDFS location to the location defined for the table.
>>
>> Later on when you  query the table the files will be scanned. If there
>> are in the right format you’ll get results. If not, then no.
>>
>>
>>
>> *From:* Dmitry Goldenberg [mailto:dgoldenberg@hexastax.com]
>> *Sent:* Tuesday, April 04, 2017 8:54 PM
>> *To:* user@hive.apache.org
>> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a
>> PARTITIONED, STORED AS PARQUET table?
>>
>>
>>
>> Thanks, Dudu. I think there's a disconnect here. We're using LOAD INPATH
>> on a few tables to achieve the effect of actual insertion of records. Is it
>> not the case that the LOAD causes the data to get inserted into Hive?
>>
>> Based on that I'd like to understand whether we can get away with using
>> LOAD INPATH instead of INSERT/SELECT FROM.
>>
>>
>> On Apr 4, 2017, at 1:43 PM, Markovitz, Dudu <dmarkovitz@paypal.com>
>> wrote:
>>
>> I just want to verify that you understand the following:
>>
>>
>>
>> ·         LOAD DATA INPATH is just a HDFS file movement operation.
>>
>> You can achieve the same results by using *hdfs dfs -mv …*
>>
>>
>>
>> ·         LOAD DATA LOCAL  INPATH is just a file copying operation from
>> the shell to the HDFS.
>>
>> You can achieve the same results by using *hdfs dfs -put …*
>>
>>
>>
>>
>>
>> *From:* Dmitry Goldenberg [mailto:dgoldenberg@hexastax.com
>> <dgoldenberg@hexastax.com>]
>> *Sent:* Tuesday, April 04, 2017 7:48 PM
>> *To:* user@hive.apache.org
>> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a
>> PARTITIONED, STORED AS PARQUET table?
>>
>>
>>
>> Dudu,
>>
>>
>>
>> This is still in design stages, so we have a way to get the data from its
>> source. The data is *not* in the Parquet format.  It's up to us to format
>> it the best and most efficient way.  We can roll with CSV or Parquet;
>> ultimately the data must make it into a pre-defined PARQUET, PARTITIONED
>> table in Hive.
>>
>>
>>
>> Thanks,
>>
>> - Dmitry
>>
>>
>>
>> On Tue, Apr 4, 2017 at 12:20 PM, Markovitz, Dudu <dmarkovitz@paypal.com>
>> wrote:
>>
>> Are your files already in Parquet format?
>>
>>
>>
>> *From:* Dmitry Goldenberg [mailto:dgoldenberg@hexastax.com]
>> *Sent:* Tuesday, April 04, 2017 7:03 PM
>> *To:* user@hive.apache.org
>> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a
>> PARTITIONED, STORED AS PARQUET table?
>>
>>
>>
>> Thanks, Dudu.
>>
>>
>>
>> Just to re-iterate; the way I'm reading your response is that yes, we can
>> use LOAD INPATH for a PARQUET, PARTITIONED table, provided that the data in
>> the delimited file is properly formatted.  Then we can LOAD it into the
>> table (mytable in my example) directly and avoid the creation of the temp
>> table (origtable in my example).  Correct so far?
>>
>>
>>
>> I did not quite follow the latter part of your response:
>>
>> >> You should only create an external table which is an interface to
>> read the files and use it in an INSERT operation.
>>
>>
>>
>> My assumption was that we would LOAD INPATH and not have to use INSERT
>> altogether.  Am I missing something in groking this latter part of your
>> response?
>>
>>
>>
>> Thanks,
>>
>> - Dmitry
>>
>>
>>
>> On Tue, Apr 4, 2017 at 11:26 AM, Markovitz, Dudu <dmarkovitz@paypal.com>
>> wrote:
>>
>> Since LOAD DATA INPATH  only moves files the answer is very simple.
>>
>> If you’re files are already in a format that matches the destination
>> table (storage type, number and types of columns etc.) then – yes and if
>> not, then – no.
>>
>>
>>
>> But –
>>
>> You don’t need to load the files into intermediary table.
>>
>> You should only create an external table which is an interface to read
>> the files and use it in an INSERT operation.
>>
>>
>>
>> Dudu
>>
>>
>>
>> *From:* Dmitry Goldenberg [mailto:dgoldenberg@hexastax.com]
>> *Sent:* Tuesday, April 04, 2017 4:52 PM
>> *To:* user@hive.apache.org
>> *Subject:* Is it possible to use LOAD DATA INPATH with a PARTITIONED,
>> STORED AS PARQUET table?
>>
>>
>>
>> We have a table such as the following defined:
>>
>> CREATE TABLE IF NOT EXISTS db.mytable (
>>   `item_id` string,
>>   `timestamp` string,
>>   `item_comments` string)
>> PARTITIONED BY (`date`, `content_type`)
>> STORED AS PARQUET;
>>
>> Currently we insert data into this PARQUET, PARTITIONED table as follows,
>> using an intermediary table:
>>
>> INSERT INTO TABLE db.mytable PARTITION(date, content_type)
>> SELECT itemid as item_id, itemts as timestamp, date, content_type
>> FROM db.origtable
>> WHERE date = “${SELECTED_DATE}”
>> GROUP BY item_id, date, content_type;
>>
>> Our question is, would it be possible to use the LOAD DATA INPATH.. INTO
>> TABLE syntax to load the data from delimited data files into 'mytable'
>> rather than populating mytable from the intermediary table?
>>
>>
>>
>> I see in the Hive documentation that:
>>
>> * Load operations are currently pure copy/move operations that move
>> datafiles into locations corresponding to Hive tables.
>>
>> * If the table is partitioned, then one must specify a specific partition
>> of the table by specifying values for all of the partitioning columns.
>>
>>
>>
>> This seems to indicate that using LOAD is possible; however looking at
>> this discussion: http://grokbase.com/t/hive/user/114frbfg0y/can-i
>> -use-hive-dynamic-partition-while-loading-data-into-tables
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__grokbase.com_t_hive_user_114frbfg0y_can-2Di-2Duse-2Dhive-2Ddynamic-2Dpartition-2Dwhile-2Dloading-2Ddata-2Dinto-2Dtables&d=DwMFaQ&c=9WYoWBgz3TbmQlstBqb6LDRA8PY_DPmoAS0YWoTLU-g&r=_W3sXrqd7teXL8R6ey10dgFH1GT5KbehFX_EaUG41XM&m=w2-Xt3zXd67KWRPyy83l4Kn5EWquC767DmMpcE5RpgI&s=01kme5ZDH2EBjzLWRz6kJ5jQ9vxr-IzFeNepynsQ7-M&e=>,
>> perhaps not?
>>
>>
>>
>> We'd like to understand if using LOAD in the case of PARQUET, PARTITIONED
>> tables is possible and if so, then how does one go about using LOAD in that
>> case?
>>
>>
>>
>> Thanks,
>>
>> - Dmitry
>>
>>
>>
>>
>>
>>
>>
>>
>> ------------------------------
>>
>> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS
>> CONFIDENTIAL and may contain information that is privileged and exempt from
>> disclosure under applicable law. If you are neither the intended recipient
>> nor responsible for delivering the message to the intended recipient,
>> please note that any dissemination, distribution, copying or the taking of
>> any action in reliance upon the message is strictly prohibited. If you have
>> received this communication in error, please notify the sender immediately.
>> Thank you.
>>
>>
>> ------------------------------
>> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS
>> CONFIDENTIAL and may contain information that is privileged and exempt from
>> disclosure under applicable law. If you are neither the intended recipient
>> nor responsible for delivering the message to the intended recipient,
>> please note that any dissemination, distribution, copying or the taking of
>> any action in reliance upon the message is strictly prohibited. If you have
>> received this communication in error, please notify the sender immediately.
>> Thank you.
>>
>
>

Mime
View raw message