hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Залеский Александр Андреевич <aazal...@mts.ru>
Subject RE: READING STRING, CONTAINS \R\N, FROM ORC FILES VIA JDBC DRIVER PRODUCES DIRTY DATA
Date Fri, 03 Nov 2017 14:22:40 GMT
Yes, we storing data in ORC files correctly, the problem appears when we reading it via jdbc.
We generate ORC files through org.apache.orc library and load into hive via load data inpath
command. But, then we read them, jdbc does that awful split

From: Owen O'Malley [mailto:owen.omalley@gmail.com]
Sent: Thursday, November 02, 2017 6:21 PM
To: user@hive.apache.org
Subject: Re: READING STRING, CONTAINS \R\N, FROM ORC FILES VIA JDBC DRIVER PRODUCES DIRTY
DATA

ORC stores the data in UTF-8 with the length of the value stored explicitly. Therefore, it
doesn't do any parsing of newlines.

You can see the contents of an ORC file by using:

% hive --orcfiledump -d <path_to_file>

from https://orc.apache.org/docs/hive-ddl.html . How did you load the data into Hive?

... Owen

On Thu, Nov 2, 2017 at 5:29 AM, Залеский Александр Андреевич <aazalesk@mts.ru<mailto:aazalesk@mts.ru>>
wrote:
My problem is to read data with “newline” character from ORC via jdbc. Standard behavior
for reading string – split row for every newline symbol, and that seems like a bug. Why
I couldn’t store any symbols in my data? Why jdbc read them as control symbols? I have created
issue to terradata (https://tays.teradata.com/home/?language=en_US&aidIncidentId=RECHDBRVV)
and they give me advice to write own SerDe. Perhaps, that is not unique task, and you already
wrote such SerDe, can I ask for it?

Mime
View raw message