hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Furcy Pin <furcy....@flaminem.com>
Subject Re: CSV file reading in hive
Date Fri, 13 Feb 2015 10:08:54 GMT
Hi Sreeman,

Unfortunately, I don't think that Hive built-in format can currently read
csv files with fields enclosed in double quotes.
More generally, for having ingested quite a lot of messy csv files myself,
I would recommend you to write a MapReduce (or Spark) job
for cleaning your csv before giving it to Hive. This is what I did.
The (other) kind of issue I've met were among :

   - File not encoded in utf-8, making special characters unreadable for
   Hive
   - Some lines with missing or too many columns, which could shift your
   columns and ruin your stats.
   - Some lines with unreadable characters (probably data corruption)
   - I even got some lines with java stack traces in it

I hope your csv is cleaner than that, and would recommend that if you have
the control on how it is generated, replace your current separator with tab
(and replace inline tabs with \t) or something like that.

There might be some open source tools for data cleaning already out there.
I plan to release mine one day, once I've migrated it to Spark maybe, and
if my company agrees.

If you're lazy, I heard that Dataiku Studio (which has a free version) can
do such thing, though I never used it myself.

Hope this helps,

Furcy



2015-02-13 7:30 GMT+01:00 Slava Markeyev <slava.markeyev@upsight.com>:

> You can use lazy simple serde with ROW FORMAT DELIMITED FIELDS TERMINATED
> BY ',' ESCAPED BY '\'. Check the DDL for details
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
>
>
>
> On Thu, Feb 12, 2015 at 8:19 PM, Sreeman <sreebalineni@gmail.com> wrote:
>
>>  Hi All,
>>
>> How all of you are creating hive/Impala table when the CSV file has some
>> values with COMMA in between. it is like
>>
>> sree,12345,"payment made,but it is not successful"
>>
>>
>>
>>
>>
>> I know opencsv serde is there but it is not available in lower versions
>> of Hive 14.0
>>
>>
>>
>
>
>
> --
>
> Slava Markeyev | Engineering | Upsight
> Find me on LinkedIn <http://www.linkedin.com/in/slavamarkeyev>
> <http://www.linkedin.com/in/slavamarkeyev>
>

Mime
View raw message