hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ronan stokes (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-8763) Support for use of enclosed quotes in LazySimpleSerde
Date Mon, 10 Nov 2014 10:54:34 GMT

     [ https://issues.apache.org/jira/browse/HIVE-8763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

ronan stokes updated HIVE-8763:
-------------------------------
    Attachment: HIVE-8763.1.patch

Initial patch for changes

Adds 4 new properties to LazySimpleSerde

"field.doublequotes.as.quote" (true/false - default false)
"field.rtrim"; - (true/false - default false)
"field.ltrim" - (true/false - default false)

and uses "quote.delim" to specify quote

Currently support is for read only and applies to CHAR, VARCHAR and STRING inside in CSV style
data

ltrim and rtrim options will allow both string and non string data to be parsed correctly
when leading or trailing spaces or tabs are present. For strings ltrim/rstrim will skip whitespace
either side of quotes if quote enclosures are being used.





> Support for use of enclosed quotes in LazySimpleSerde
> -----------------------------------------------------
>
>                 Key: HIVE-8763
>                 URL: https://issues.apache.org/jira/browse/HIVE-8763
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.11.0, 0.12.0, 0.13.0, 0.13.1
>         Environment: many - verified on Centos / Redhat with CDH
>            Reporter: ronan stokes
>         Attachments: HIVE-8763.1.patch
>
>
> Currently the LazySimpleSerde does not support the use of quotes for delimited fields
to allow use of separators within a quoted field - this means having to use alternatives for
many common use cases for CSV style data. 
> Key scenarios that do not work include:
> (3 column row for int, string, float delimited by ',')
> 100,"3.5 inch hard drive, quantity 10",2650.30
> 100,"3.5 \" hard drive, quantity 10",2650.30
> 100,  "3.5 "" hard drive, quantity 10",  2650.30
> 100,"3.5 "" hard drive, quantity 10",2650.30
> There are a number of fixes that I have implemented support in the deserialization stage
to a copy of the Lazy simple serde to address this:
> For serialization, the code is unchanged with the relevant embedded characters being
escaped.
> Assuming a row with 3 fields - SKU ID, description, price, delimited by ','
> 1) allow use of enclosed quotes around a string field 
> For example 
> 100,"3.5 inch hard drive, quantity 10",2650.30
> 2) support escaping of quotes within field to allow use of embedded quote
> 100,"3.5 \" hard drive, quantity 10",2650.30
> 3) support for old style CSV embedded quotes 
> for example 
> 100,"3.5 "" hard drive, quantity 10",2650.30
> 4) support for skipping of leading spaces in field
> For example (note space between first ',' and opening quote)
> 100,  "3.5 "" hard drive, quantity 10",  2650.30
> In each case, with the changes these are evaluated as though the delimiters and embedded
quotes were escaped:
> e.g
> 100, 3.5 \" hard drive\, quantity 10,  2650.30
> All of these are enabled or disabled using serde properties for quotechar, whether enclosed
quotes is supported, whether double embedded quotes are treated as single quote (of same char
type)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message