hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-51) Generate and accept JSON as the input-output format from mappers and reducers
Date Fri, 14 Nov 2008 01:58:44 GMT

    [ https://issues.apache.org/jira/browse/HIVE-51?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647496#action_12647496
] 

Zheng Shao commented on HIVE-51:
--------------------------------

An alternative approach is to specify that right in the query:

MAP table.col1, table.col2
ROW FORMAT SERDE 'JSONSerDe'
USING 'python filter.py'
AS x1, x2
ROW FORMAT SERDE 'JSONSerDe'

This makes the syntax for specifying row format the same in map/reduce scripts and in create
table statement.
At the same time we will be able to support ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
etc.


> Generate and accept JSON as the input-output format from mappers and reducers
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-51
>                 URL: https://issues.apache.org/jira/browse/HIVE-51
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: Venky Iyer
>
> set mapred.data.format=JSON;
> ....
> MAP USING 'python filter.py'
> ....;
> would mean that filter.py would receive a JSON formatted dictionary of the columns instead
of a tab-delimited string.
> { column1: value1, column2: [1,2,3] } etc
> It would in turn produce JSON.
> This should be done so that the JSON is not transmitted back and forth over the network;
it would be generated on the fly on the mapper node, and converted back to the standard format
used (tab-delimited, I assume).
> This seems like the simplest way for encoding type information in the input to mappers;
it would also remove the need for silly boilerplate code that took a list of expected input
column names, took the input stream, split it up, and made a dictionary of {column name: value}
on every record.
> Output schemas (USING '' AS ...) might also be redundant with this in place, but I'm
not sure if that is doable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message