hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Doubrovkine (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Deleted] (HIVE-16351) Hive confused by CR/LFs
Date Mon, 03 Apr 2017 20:51:41 GMT

     [ https://issues.apache.org/jira/browse/HIVE-16351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Doubrovkine updated HIVE-16351:
--------------------------------------
    Comment: was deleted

(was: Those NULLs are actually from an empty record. Is that right? Should the last empty
line be ignored since these are text files and if so which components responsibility is it?


{code}
~$ cat /tmp/test.json 
{"text":"foo\nbar","number":123}
{"text":"bar\nfoo","number":345}
~$ perl -pe 'chomp if eof' /tmp/test.json > /tmp/test2.json
~$ cat /tmp/test2.json 
{"text":"foo\nbar","number":123}
{"text":"bar\nfoo","number":345}~$ 
$ hadoop fs -put -f /tmp/test2.json /user/data/test.json
$ hive
hive> SELECT * FROM test;
OK
foo
bar	123
bar
foo	345
{code})

> Hive confused by CR/LFs
> -----------------------
>
>                 Key: HIVE-16351
>                 URL: https://issues.apache.org/jira/browse/HIVE-16351
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive, Serializers/Deserializers
>    Affects Versions: 1.2.1
>         Environment: Hadoop 2.7.3
>            Reporter: Daniel Doubrovkine
>
> Hive is returning broken data that contains CR/LF.
> {code}
> CREATE DATABASE positron;
> CREATE EXTERNAL TABLE positron.articles (
>   `_id` struct<oid:string>,
>   `channel_id` struct<oid:string>,
>   `exclude_google_news` boolean,
>   `fair_ids` array<map<string,string>>,
>   `hero_section` map<string,string>,
>   `partner_ids` array<map<string,string>>,
>   `description` string,
>   `partner_channel_id` struct<oid:string>,
>   `published` boolean,
>   `published_at` map<string,string>,
>   `slugs` array<string>,
>   `sections` array<map<string,string>>,
>   `thumbnail_image` string,
>   `title` string
> )
> ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> WITH serdeproperties ('mapping.oid' = '$oid')
> LOCATION '/user/data/positron/articles';
> SELECT
>   a.slugs[SIZE(a.slugs) - 1],
>   a.title,
>   a.thumbnail_image
> FROM
>   positron.articles as a
> WHERE a.published = true
> AND a.hero_section["type"] = "video"
> AND (a.channel_id IS NOT NULL OR a.partner_channel_id IS NOT NULL)
> AND a.thumbnail_image IS NOT NULL;
> {code}
> Note the NULLs below.
> {code}
> astrid-caroline-cole-sneak-peek-realities-by-marc-gumpinger   Sneak peek "Realities"
by Marc Gumpinger   https://artsy-media-uploads.s3.amazonaws.com/bUb1l_4g6cMhcDxEaPYDxw%2Facc_signature.png
> artsy-editorial-how-art-fairs-expanded-the-contemporary-art-market   The Art Market,
Explained: The Rise of the Art Fair   https://artsy-media-uploads.s3.amazonaws.com/j8GIeamyufubMBgJFNHbFA%2Fartfairsex.jpg
> nolongercreek  Alexandra Kehayoglou x Artsy: NULL
> No Longer Creek   https://d32dm0rphc51dk.cloudfront.net/5oRwy7ysKHohtahIYUTE9Q/larger.jpg
NULL
> kukje-gallery-teaser-trailer-kim-yong-ik  Teaser Trailer | Kim Yong-Ik  https://artsy-media-uploads.s3.amazonaws.com/mmoZcz0imuUzCavkObKgVQ%2Fkyi+thumbnail.PNG
> {code}
> {code}
> $ hive --version
> Hive 1.1.0-cdh5.6.0
> Subversion file:///data/jenkins/workspace/generic-package-ubuntu64-14-04/CDH5.6.0-Packaging-Hive-2016-01-28_21-19-00/hive-1.1.0+cdh5.6.0+377-1.cdh5.6.0.p0.110~trusty
-r Unknown
> Compiled by jenkins on Thu Jan 28 21:35:50 PST 2016
> From source with checksum b4a8fadbcf1ca36d11d91805d3ec2743
> {code}
> What's very interesting is that I am not able to reproduce this locally with the same
data with any version of hive. Any help appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message