hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prasad Nagaraj Subramanya <prasadn...@gmail.com>
Subject Re: HIVE Parquet column names issue
Date Fri, 26 Jan 2018 16:29:13 GMT
Hi Brandon,

Have you tried creating an external table with the required names for
parquet -

CREATE EXTERNAL TABLE IF NOT EXISTS EVENTS_PARQUET(
    `release` STRING,
    `customer` STRING,
    `cookie` STRING,
    `category` STRING,
    `end_time` STRING,
    `start_time` STRING,
    `first_name` STRING,
    `email` STRING,
    `phone` STRING,
    `last_name` STRING,
    `site` STRING,
    `source` STRING,
    `subject` STRING,
    `raw` STRING
)
STORED AS PARQUET
LOCATION '${OUTPUT}';


And then inserting data into this table from your csv table -

INSERT OVERWRITE TABLE EVENTS_PARQUET SELECT * FROM EVENTS;


This will create a parquet file at the specified location (${OUTPUT})

Thanks,
Prasad

On Fri, Jan 26, 2018 at 7:45 AM, Brandon Cooke <brandon.cooke@engage.cx>
wrote:

> Hello,
>
> I posted the following on a Cloudera forum but haven’t had much luck.
> I’m hoping someone here can tell me what step I have probably missed:
>
> Hello,
>
>
> I'm using HIVE (v1.2.1) to convert our data files from CSV into Parquet
> for use in AWS Athena.
> However, no mater what I try the resulting Parquet always has columns
> titles *[_col0, _col1, ..., _colN]*
>
>
> After researching, I read that the line *SET
> parquet.column.index.access=false *was supposed to allow for Parquet to
> use the column titles of my HIVE table; however, it has been unsuccessful
> so far.
>
>
> Below is an example script I use to create the Parquet from data
>
>
>
> SET parquet.column.index.access=false;
>
> CREATE EXTERNAL TABLE IF NOT EXISTS EVENTS(
>     `release` STRING,
>     `customer` STRING,
>     `cookie` STRING,
>     `category` STRING,
>     `end_time` STRING,
>     `start_time` STRING,
>     `first_name` STRING,
>     `email` STRING,
>     `phone` STRING,
>     `last_name` STRING,
>     `site` STRING,
>     `source` STRING,
>     `subject` STRING,
>     `raw` STRING
> )
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> LOCATION '${INPUT}';
>
> INSERT OVERWRITE DIRECTORY '${OUTPUT}/parquet'
> STORED AS PARQUET
> SELECT *
> FROM EVENTS;
>
>
> Using parquet-tools, I read the resulting file and below is an example
> output:
>
>
> _col0 = 0.1.2
> _col1 = customer1
> _col2 = NULL
> _col3 = api
> _col4 = 2018-01-21T06:57:57Z
> _col5 = 2018-01-21T06:57:56Z
> _col6 = Brandon
> _col7 = brandon@fakesite.com
> _col8 = 999-999-9999
> _col9 = Pompei
> _col10 = Boston
> _col11 = Wifi
> _col12 = NULL
> _col13 = eyJlbmdhZ2VtZW50TWVkaXVtIjoibm9uZSIsImVudHJ5UG9pbnRJZCI6ImQ5YjYwN2UzLTFlN2QtNGY1YS1iZWQ4LWQ4Yjk3NmRkZTQ3MiIsIkVDWF9FVkVOVF9DQVRFR09SWV9BUElfTkFNRSI6IkVDWF9FQ19TSVRFVFJBQ0tfU0lURV9WSVNJVCIsIkVDWF9TSVRFX1JFR0lPTl9BUElfTkFNRSI
>
>
> This is problematic because it is impossible to transfer it to an Athena
> table (or even back to HIVE) without using these index-based column
> titles. I need HIVE's column titles to transfer over to the Parquet file.
>
>
> I've search for a very long time and have come up short. Am I doing
> something wrong?
> Please let me know if I can provide more information. Thank you!
>
> I appreciate your time.
> Sincerely,
>
> Brandon Cooke
> Software Engineer
> engage.cx
> 5500 Interstate N Parkway Suite 130
> <https://maps.google.com/?q=5500+Interstate+N+Parkway+Suite+130&entry=gmail&source=g>
>

Mime
View raw message