pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "AvoidingSedes" by ThejasNair
Date Mon, 28 Jun 2010 23:37:38 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The "AvoidingSedes" page has been changed by ThejasNair.


- = Avoiding Serialization/De-serialization in pig
+ = Avoiding Serialization/De-serialization in pig =
  Serialization/De-serialization is expensive and avoiding it will improve performance.
- = Delaying/Avoiding deserialization at runtime
+ == Delaying/Avoiding deserialization at runtime ==
  These approaches does not involve any changes to core pig code. Load functions, or serialization
between map and reduce can be separately changed to improve performance.
   1. '''!LoadFunctions make use of public interface !LoadPushDown.pushDownProjection.'''
Don't deserialize columns not that are not in required . This should always improve performance.
!PigStorage indirectly works this way, if a column is not used, the optimizer removes the
casting(ie deserialization) of the column from the type-casting foreach statement which comes
after the load.
   1. '''!LoadFunction return a custom tuple, which deserializes fields only when tuple.get(i)
is called.''' </b> This can be useful if the first operator after load is a filter operator
- the whole filter expression might not have to be evaluated and that deserialization of all
columns might not have to be done. Assuming the first approach is already implemented, then
this approach is likely to have some overhead with queries where all tuple.get(i) is called
on all columns/rows.

View raw message