hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gerber, Bryan W" <Bryan.Ger...@pnnl.gov>
Subject RE: Loading data containing newlines
Date Tue, 12 Jan 2016 17:58:37 GMT
From that wiki:
"This SerDe works for most CSV data, but does not handle embedded newlines."

The Hive SerDe interface is all downstream of the TextInputFormat, which has already split
records by newlines.  In theory you can give it a different line delimiter, but Hive 1.2.1
does not support it: "FAILED: SemanticException 3:20 LINES TERMINATED BY only supports newline
'\n' right now."

From: Alexander Pivovarov [mailto:apivovarov@gmail.com]
Sent: Tuesday, January 12, 2016 9:52 AM
To: user@hive.apache.org
Subject: Re: Loading data containing newlines

Try CSV serde. It should correctly parse quoted field value having newline inside
https://cwiki.apache.org/confluence/display/Hive/CSV+Serde

Hadoop should automatically read bz2 files


On Tue, Jan 12, 2016 at 9:40 AM, Gerber, Bryan W <Bryan.Gerber@pnnl.gov<mailto:Bryan.Gerber@pnnl.gov>>
wrote:
We are attempting to load CSV text files (compressed to bz2) containing newlines in fields
using EXTERNAL tables and INSERT/SELECT into ORC format tables.  Data volume is ~1TB/day,
we are really trying to avoid unpacking them to condition the data.

A few days of research has us ready to implement custom  input/output formats to handle the
ingest.  Any other suggestions that may be less effort with low impact to load times?

Thanks,
Bryan G.

Mime
View raw message