pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "AvoidingSedes" by ThejasNair
Date Mon, 28 Jun 2010 23:37:00 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The "AvoidingSedes" page has been changed by ThejasNair.
http://wiki.apache.org/pig/AvoidingSedes

--------------------------------------------------

New page:
= Avoiding Serialization/De-serialization in pig
Serialization/De-serialization is expensive and avoiding it will improve performance.


= Delaying/Avoiding deserialization at runtime
These approaches does not involve any changes to core pig code. Load functions, or serialization
between map and reduce can be separately changed to improve performance.
 1. '''!LoadFunctions make use of public interface !LoadPushDown.pushDownProjection.''' Don't
deserialize columns not that are not in required . This should always improve performance.
!PigStorage indirectly works this way, if a column is not used, the optimizer removes the
casting(ie deserialization) of the column from the type-casting foreach statement which comes
after the load.
 1. '''!LoadFunction return a custom tuple, which deserializes fields only when tuple.get(i)
is called.''' </b> This can be useful if the first operator after load is a filter operator
- the whole filter expression might not have to be evaluated and that deserialization of all
columns might not have to be done. Assuming the first approach is already implemented, then
this approach is likely to have some overhead with queries where all tuple.get(i) is called
on all columns/rows.
 1. '''!LoadFunction delays deserialization of map and bag types until a member function of
java.util.Map or !DataBag is called. ''' The load function uses subclass of Map and DataBag
which holds the serialized copy. This will help in delaying the deserialization further. This
can't be done for scalar types because the classes pig uses for them are final; even if that
were not the case we might not see much of performance gain because of the cost of creating
an copy of the serialized data might be high compared to the cost of deserialization. This
will only delay serialization up to the MR boundaries. 
{{{
Example of query where this will help -
l = LOAD 'file1' AS (a : int, b : map [ ]);
f = FOREACH l GENERATE udf1(a), b;       -- Approach 2 will not help in delaying deserialization
beyond this point.
fil = FILTER f BY $0 > 5;
dump fil; -- Serialization of column b can be delayed until here using this approach .
}}}
 1.#4 '''Set the property "pig.data.tuple.factory.name" to use a tuple that understands serialization
format used for bags and maps used in approach 3, so that serialized data can be passed from
loader across MR boundaries in the serialization format of load function. ''' The write()
and readFields() functions of tuple returned by TupleFactory is used to serialize data between
Map and Reduce. To use a new custom tuple, you need to use a custom TupleFactory that returns
tuples of this type. But this approach will work only for a set of load functions in the query
that share same serialization format for map and bags.
 1. ''' Expose load function's sedes functionality in new interface and track lineage of columns'''
This will the elegant and extensible way of doing what is proposed in approach 4. For each
serialized column, if we know the deserialization function, we can delay deserialization across
MR boundaries.

Mime
View raw message