hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Doubrovkine (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-16351) Hive confused by CR/LFs
Date Mon, 03 Apr 2017 20:53:41 GMT

     [ https://issues.apache.org/jira/browse/HIVE-16351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Doubrovkine updated HIVE-16351:
--------------------------------------
    Attachment: Screen Shot 2017-04-03 at 4.33.23 PM.png

> Hive confused by CR/LFs
> -----------------------
>
>                 Key: HIVE-16351
>                 URL: https://issues.apache.org/jira/browse/HIVE-16351
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive, Serializers/Deserializers
>    Affects Versions: 1.2.1
>         Environment: Hadoop 2.7.3
>            Reporter: Daniel Doubrovkine
>         Attachments: Screen Shot 2017-04-03 at 4.33.23 PM.png
>
>
> Hive is returning broken data that contains CR/LF.
> {code}
> CREATE DATABASE positron;
> CREATE EXTERNAL TABLE positron.articles (
>   `_id` struct<oid:string>,
>   `channel_id` struct<oid:string>,
>   `exclude_google_news` boolean,
>   `fair_ids` array<map<string,string>>,
>   `hero_section` map<string,string>,
>   `partner_ids` array<map<string,string>>,
>   `description` string,
>   `partner_channel_id` struct<oid:string>,
>   `published` boolean,
>   `published_at` map<string,string>,
>   `slugs` array<string>,
>   `sections` array<map<string,string>>,
>   `thumbnail_image` string,
>   `title` string
> )
> ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> WITH serdeproperties ('mapping.oid' = '$oid')
> LOCATION '/user/data/positron/articles';
> SELECT
>   a.slugs[SIZE(a.slugs) - 1],
>   a.title,
>   a.thumbnail_image
> FROM
>   positron.articles as a
> WHERE a.published = true
> AND a.hero_section["type"] = "video"
> AND (a.channel_id IS NOT NULL OR a.partner_channel_id IS NOT NULL)
> AND a.thumbnail_image IS NOT NULL;
> {code}
> Note the NULLs below. The `title` field has a CR/LF in it.
> {code}
> astrid-caroline-cole-sneak-peek-realities-by-marc-gumpinger   Sneak peek "Realities"
by Marc Gumpinger   https://artsy-media-uploads.s3.amazonaws.com/bUb1l_4g6cMhcDxEaPYDxw%2Facc_signature.png
> artsy-editorial-how-art-fairs-expanded-the-contemporary-art-market   The Art Market,
Explained: The Rise of the Art Fair   https://artsy-media-uploads.s3.amazonaws.com/j8GIeamyufubMBgJFNHbFA%2Fartfairsex.jpg
> nolongercreek  Alexandra Kehayoglou x Artsy: NULL
> No Longer Creek   https://d32dm0rphc51dk.cloudfront.net/5oRwy7ysKHohtahIYUTE9Q/larger.jpg
NULL
> kukje-gallery-teaser-trailer-kim-yong-ik  Teaser Trailer | Kim Yong-Ik  https://artsy-media-uploads.s3.amazonaws.com/mmoZcz0imuUzCavkObKgVQ%2Fkyi+thumbnail.PNG
> {code}
> {code}
> $ hive --version
> Hive 1.1.0-cdh5.6.0
> Subversion file:///data/jenkins/workspace/generic-package-ubuntu64-14-04/CDH5.6.0-Packaging-Hive-2016-01-28_21-19-00/hive-1.1.0+cdh5.6.0+377-1.cdh5.6.0.p0.110~trusty
-r Unknown
> Compiled by jenkins on Thu Jan 28 21:35:50 PST 2016
> From source with checksum b4a8fadbcf1ca36d11d91805d3ec2743
> {code}
> What's very interesting is that I am not able to reproduce this locally with the same
data with any version of hive. It only happens in our Cloudera cluster. Any help appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message