hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Furcy Pin <furcy....@flaminem.com>
Subject Re: Best way to load CSV file into Hive
Date Sun, 01 Nov 2015 09:22:17 GMT
Hi Vijaya,

If you need some nice ETL capabilities, you may want to try
https://github.com/databricks/spark-csv

Among other things, spark-csv let you read the csv as is and create and
insert a copy of the
data into a Hive table with any format you like (Parquet, ORC, etc.)

If you have a header row, it can remove it and use it get the column names
directly, and it can also perform automatic type detection.
You can specify the delimiterChar and the quoteChar, but I did not see the
escapeChar in the doc.

In the end, it's as easy as :

val df : DataFrame =
    sqlContext
    .read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("delimiter", ",") // Specify delimiter
    .option("quote", "\"") // Specify quoteChar
    .option("inferSchema", "true") // Automatically infer data types
    .load("path/to/data.csv")

df.write
  .format("orc")
  .saveAsTable("db_name.table_name")


I believe HDP now supports spark.




On Sat, Oct 31, 2015 at 10:30 PM, Jörn Franke <jornfranke@gmail.com> wrote:

> You clearly need to escape those characters as for any other tool. You may
> want to use avro instead of csv , xml or JSON etc
>
> On 30 Oct 2015, at 19:16, Vijaya Narayana Reddy Bhoomi Reddy <
> vijaya.bhoomireddy@whishworks.com> wrote:
>
> Hi,
>
> I have a CSV file which contains hunderd thousand rows and about 200+
> columns. Some of the columns have free text information, which means it
> might contain characters like comma, colon, quotes etc with in the column
> content.
>
> What is the best way to load such CSV file into Hive?
>
> Another serious issue, I have stored the file in a location in HDFS and
> then created an external hive table on it. However, upon running Create
> external table using HDP Hive View, the original CSV is no longer present
> in the folder where it is meant to be. Not sure on how HDP processes and
> where it is stored? My understanding was that EXTERNAL table wouldnt be
> moved from their original HDFS location?
>
> Request someone to help out!
>
>
> Thanks & Regards
> Vijay
>
>
>
> The contents of this e-mail are confidential and for the exclusive use of
> the intended recipient. If you receive this e-mail in error please delete
> it from your system immediately and notify us either by e-mail or
> telephone. You should not copy, forward or otherwise disclose the content
> of the e-mail. The views expressed in this communication may not
> necessarily be the view held by WHISHWORKS.
>
>

Mime
View raw message