hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thejas M Nair (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (PIG-1473) Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation
Date Thu, 15 Jul 2010 21:25:50 GMT

     [ https://issues.apache.org/jira/browse/PIG-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Thejas M Nair resolved PIG-1473.

    Resolution: Won't Fix

Implementing lazy de-serialization using this approach will introduce a non backward compatible
change in PigStorage . So closing this jira as wontfix. 

In PigStorage, if the de-serialization fails, the value is treated as null, ie tuple.get(i)
returns null .
But if the de-serialization is delayed by returning a subclass of map or bag that holds the
serialized data, the tuple.get(i) call will return a non null value even if the serialized
format has a problem.

Though this approach is not being implemented in PigStorage() for this reason, other load
store functions can potentially adopt this method.

> Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag
> -----------------------------------------------------------------------------------------------------
>                 Key: PIG-1473
>                 URL: https://issues.apache.org/jira/browse/PIG-1473
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
> Cost of serialization/deserialization (sedes) can be very high and avoiding it will improve
> Avoid sedes when possible by implementing approach #3 proposed in http://wiki.apache.org/pig/AvoidingSedes
> The load function uses subclass of Map and DataBag which holds the serialized copy. 
LoadFunction delays deserialization of map and bag types until a member function of java.util.Map
or DataBag is called. 
> Example of query where this will help -
> {CODE}
> l = LOAD 'file1' AS (a : int, b : map [ ]);
> f = FOREACH l GENERATE udf1(a), b;      
> fil = FILTER f BY $0 > 5;
> dump fil; -- Serialization of column b can be delayed until here using this approach
> {CODE}

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message