pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vivek Padmanabhan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1842) Improve Scalability of the XMLLoader for large datasets such as wikipedia
Date Tue, 01 Mar 2011 09:26:36 GMT

    [ https://issues.apache.org/jira/browse/PIG-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000794#comment-13000794
] 

Vivek Padmanabhan commented on PIG-1842:
----------------------------------------

The errors are because PIG-1839(XMLLoader will always add an extra empty tuple even if no
tags are matched) was not applied to 0.8 branch which corrects these test cases. 

> Improve Scalability of the XMLLoader for large datasets such as wikipedia
> -------------------------------------------------------------------------
>
>                 Key: PIG-1842
>                 URL: https://issues.apache.org/jira/browse/PIG-1842
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0, 0.8.0, 0.9.0
>            Reporter: Viraj Bhat
>            Assignee: Vivek Padmanabhan
>             Fix For: 0.7.0, 0.8.0, 0.9.0
>
>         Attachments: PIG-1842_1.patch, PIG-1842_2.patch, TEST-org.apache.pig.piggybank.test.storage.TestXMLLoader.txt
>
>
> The current XMLLoader for Pig, does not work well for large datasets such as the wikipedia
dataset. Each mapper reads in the entire XML file resulting in extermely slow run times.
> Viraj

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message