hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lefty Leverenz (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-5871) Use multiple-characters as field delimiter
Date Wed, 10 Sep 2014 04:47:30 GMT

    [ https://issues.apache.org/jira/browse/HIVE-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13829976#comment-13829976
] 

Lefty Leverenz edited comment on HIVE-5871 at 9/10/14 4:47 AM:
---------------------------------------------------------------

This implementation mainly relies on LazySimpleSerDe for serialization and deserialization.
I added some methods to LazyStruct to parse a row delimited by multiple-character string.
Another difference from LazySimpleSerDe is that MultiDelimitSerDe doesn't use Base64 to encode
binary fields in serialization. Because the encoded string may interfere with the delimiter.
I also modified LazyBinary, so that when it deserializes a binary field and is  unable to
Base64 decode the field, it just keeps the data unchanged. A simple use case is as follow:

create table test (id string,hivearray array<binary>,hivemap map<string,int>)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH SERDEPROPERTIES
("field.delimited"="[,]","collection.delimited"=":","mapkey.delimited"="@");

where field.delimited is the multiple-char field delimiter. collection.delimited is the delimiter
for collection items. mapkey.delimited is the delimiter for  keys and values in maps. We currently
don't support multiple-char for these two delimiters.

<Edited 10/Sep/14 on behalf of Rui Li>  This comment's example differs from the final
version of the patch.  See the description above for an accurate example, and note that the
SERDEPROPERTIES are *.delim rather than *.delimited.


was (Author: lirui):
This implementation mainly relies on LazySimpleSerDe for serialization and deserialization.
I added some methods to LazyStruct to parse a row delimited by multiple-character string.
Another difference from LazySimpleSerDe is that MultiDelimitSerDe doesn't use Base64 to encode
binary fields in serialization. Because the encoded string may interfere with the delimiter.
I also modified LazyBinary, so that when it deserializes a binary field and is  unable to
Base64 decode the field, it just keeps the data unchanged. A simple use case is as follow:

create table test (id string,hivearray array<binary>,hivemap map<string,int>)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH SERDEPROPERTIES
("field.delimited"="[,]","collection.delimited"=":","mapkey.delimited"="@");

where field.delimited is the multiple-char field delimiter. collection.delimited is the delimiter
for collection items. mapkey.delimited is the delimiter for  keys and values in maps. We currently
don't support multiple-char for these two delimiters.

> Use multiple-characters as field delimiter
> ------------------------------------------
>
>                 Key: HIVE-5871
>                 URL: https://issues.apache.org/jira/browse/HIVE-5871
>             Project: Hive
>          Issue Type: Improvement
>          Components: Contrib
>    Affects Versions: 0.12.0
>            Reporter: Rui Li
>            Assignee: Rui Li
>              Labels: TODOC14
>             Fix For: 0.14.0
>
>         Attachments: HIVE-5871.2.patch, HIVE-5871.3.patch, HIVE-5871.4.patch, HIVE-5871.5.patch,
HIVE-5871.6.patch, HIVE-5871.patch
>
>
> By default, hive only allows user to use single character as field delimiter. Although
there's RegexSerDe to specify multiple-character delimiter, it can be daunting to use, especially
for amateurs.
> The patch adds a new SerDe named MultiDelimitSerDe. With MultiDelimitSerDe, users can
specify a multiple-character field delimiter when creating tables, in a way most similar to
typical table creations. For example:
> {code}
> create table test (id string,hivearray array<binary>,hivemap map<string,int>)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH SERDEPROPERTIES
("field.delim"="[,]","collection.delim"=":","mapkey.delim"="@");
> {code}
> where {{field.delim}} is the field delimiter, {{collection.delim}} and {{mapkey.delim}}
is the delimiter for collection items and key value pairs, respectively. Among these delimiters,
{{field.delim}} is mandatory and can be of multiple characters, while {{collection.delim}}
and {{mapkey.delim}} is optional and only support single character.
> To use MultiDelimitSerDe, you have to add the hive-contrib jar to the class path, e.g.
with the {{add jar}} command.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message