hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1104) [zebra] Provide streaming support in Zebra.
Date Sat, 05 Dec 2009 23:19:20 GMT

    [ https://issues.apache.org/jira/browse/PIG-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786535#action_12786535
] 

Hadoop QA commented on PIG-1104:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12426998/PIG-1104.patch
  against trunk revision 887401.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 20 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit
warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/98/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/98/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/98/console

This message is automatically generated.

> [zebra] Provide streaming support in Zebra.
> -------------------------------------------
>
>                 Key: PIG-1104
>                 URL: https://issues.apache.org/jira/browse/PIG-1104
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.4.0
>            Reporter: Chao Wang
>            Assignee: Chao Wang
>             Fix For: 0.6.0, 0.7.0
>
>         Attachments: PIG-1104.patch
>
>
> Hadoop streaming is very popular among Hadoop users. The main attraction is the simplicity
of use. A user can write the application logic in any language and process large amounts of
data using Hadoop framework. As more people start to use Zebra to store their data, we expect
users would like to run Hadoop streaming scripts to easily process Zebra tables. 
> The following lists a simple example of using Hadoop streaming to access Zebra data.
It loads data from foo table using Zebra's TableInputFormat and then writes the data into
output using default TextOutputFormat. 
> $ hadoop jar hadoop-streaming.jar -D mapred.reduce.tasks=0 -input foo -output output
-mapper 'cat' -inputformat org.apache.hadoop.zebra.mapred.TableInputFormat 
> More detailed, Zebra uses Pig DefaultTuple implementation of Tuple for its records. Currently,
when Zebra's TableInputFormat is used for input, the user script sees each line containing
" key_if_any\tTuple.toString() ". We plan to generate CSV format representation of our Pig
tuples. To this end, we plan to do the following: 
> 1) Derive a sub class ZupleTuple from pig's DefaultTuple class and override its toString()
method to present the data into CSV format. 
> 2) On Zebra side, the tuple factory should be changed to create ZebraTuple objects, instead
of DefaultTuple objects. 
> Note that we can only support streaming on the input side - ability to use streaming
to read data from Zebra tables. For the output side, the streaming support is not feasible,
since the streaming mapper or reducer only emits "Text\tText", the output collector has no
way of knowing how to convert this to (BytesWritable,Tuple).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message