pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "LoadStoreMigrationGuide" by PradeepKamath
Date Tue, 23 Feb 2010 23:20:38 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The "LoadStoreMigrationGuide" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreMigrationGuide?action=diff&rev1=30&rev2=31

--------------------------------------------------

  
  == Table mapping old API calls to new API calls in rough order of call sequence ==
  ||'''Old Method in !LoadFunc''' ||'''Equivalent New Method''' ||'''New Class/Interface in
which method is present''' ||'''Explanation''' ||
- ||No equivalent method ||setUDFContextSignature() ||!LoadFunc ||This method will be called
by Pig both in the front end and back end to pass a unique signature to the Loader. The signature
can be used to store into the !UDFContext any information which the Loader needs to store
between various method invocations in the front end and back end. A use case is to store !RequiredFieldList
passed to it in !LoadPushDown.pushProjection(!RequiredFieldList) for use in the back end before
returning tuples in getNext() ||
+ ||No equivalent method ||setUDFContextSignature() ||!LoadFunc ||This method will be called
by Pig both in the front end and back end to pass a unique signature to the Loader. The signature
can be used to store into the !UDFContext any information which the Loader needs to store
between various method invocations in the front end and back end. A use case is to store !RequiredFieldList
passed to it in !LoadPushDown.pushProjection(!RequiredFieldList) for use in the back end before
returning tuples in getNext(). The default implementation in !LoadFunc has an empty body.
||
- ||No equivalent method ||relativeToAbsolutePath() ||!LoadFunc ||Pig runtime will call this
method to allow the Loader to convert a relative load location to an absolute location. The
default implementation provided in !LoadFunc handles this for hdfs files and directories.
If the load source is something else, loader implementation may choose to override this. ||
+ ||No equivalent method ||relativeToAbsolutePath() ||!LoadFunc ||Pig runtime will call this
method to allow the Loader to convert a relative load location to an absolute location. The
default implementation provided in !LoadFunc handles this for !FileSystem locations. If the
load source is something else, loader implementation may choose to override this. ||
  ||determineSchema() ||getSchema() ||!LoadMetadata ||determineSchema() was used by old code
to ask the loader to provide a schema for the data returned by it - the same semantics are
now achieved through getSchema() of the !LoadMetadata interface. !LoadMetadata is an optional
interface for loaders to implement - if a loader does not implement it, this will indicate
to the pig runtime that the loader cannot return a schema for the data ||
  ||fieldsToRead() ||pushProject() ||!LoadPushDown ||fieldsToRead() was used by old code to
convey to the loader the exact fields required by the pig script -the same semantics are now
achieved through pushProject() of the !LoadPushDown interface. !LoadPushDown is an optional
interface for loaders to implement - if a loader does not implement it, this will indicate
to the pig runtime that the loader is not capable of returning just the required fields and
will return all fields in the data. If a loader implementation is able to efficiently return
only required fields, it should implement !LoadPushDown to improve query performance ||
  ||No equivalent method ||getInputFormat() ||!LoadFunc ||This method will be called by Pig
to get the !InputFormat used by the loader. The methods in the !InputFormat (and underlying
!RecordReader) will be called by pig in the same manner (and in the same context) as by Hadoop
in a map-reduce java program. '''If the !InputFormat is a hadoop packaged one, the implementation
should use the new API based one under org.apache.hadoop.mapreduce. If it is a custom !InputFormat,
it should be implemented using the new API in org.apache.hadoop.mapreduce'''||
@@ -31, +31 @@

  An example of how a simple !LoadFunc implementation based on old interface can be converted
to the new interfaces is shown in the Examples section below.
  
  = StoreFunc Migration =
- The main change is that the new !StoreFunc API is based on a !OutputFormat to read the data.
Implementations can choose to use existing !OutputFormat like !TextOutputFormat or implement
a new one.
+ !StoreFunc is now an abstract class providing default implementations for some of the methods.
The main change is that the new !StoreFunc API is based on a !OutputFormat to read the data.
Implementations can choose to use existing !OutputFormat like !TextOutputFormat or implement
a new one.
  
  == Table mapping old API calls to new API calls in rough order of call sequence ==
  ||'''Old Method in !StoreFunc''' ||'''Equivalent New Method''' ||'''New Class/Interface
in which method is present''' ||'''Explanation''' ||
- ||No equivalent method ||setStoreFuncUDFContextSignature() ||!StoreFunc ||This method will
be called by Pig both in the front end and back end to pass a unique signature to the Storer.
The signature can be used to store into the UDFContext any information which the Storer needs
to store between various method invocations in the front end and back end. ||
+ ||No equivalent method ||setStoreFuncUDFContextSignature() ||!StoreFunc ||This method will
be called by Pig both in the front end and back end to pass a unique signature to the Storer.
The signature can be used to store into the UDFContext any information which the Storer needs
to store between various method invocations in the front end and back end. The default implementation
in !StoreFunc has an empty body. ||
- ||No equivalent method ||relToAbsPathForStoreLocation() ||!StoreFunc ||Pig runtime will
call this method to allow the Storer to convert a relative store location to an absolute location.
An implementation is provided in !LoadFunc (as a static method) which handles this for hdfs
files and directories. ||
+ ||No equivalent method ||relToAbsPathForStoreLocation() ||!StoreFunc ||Pig runtime will
call this method to allow the Storer to convert a relative store location to an absolute location.
An implementation is provided in !StoreFunc which handles this for !FileSystem based locations.
||
- ||No equivalent method ||checkSchema() ||!StoreFunc ||A Store function should implement
this function to check that a given schema describing the data to be written is acceptable
to it ||
+ ||No equivalent method ||checkSchema() ||!StoreFunc ||A Store function should implement
this function to check that a given schema describing the data to be written is acceptable
to it. The default implementation in !StoreFunc has an empty body.||
  ||No equivalent method ||setStoreLocation() ||!StoreFunc ||This method is called by Pig
to communicate the store location to the storer. The storer should use this method to communicate
the same information to the underlying !OutputFormat. This method is called multiple times
by pig - implementations should bear in mind that this method is called multiple times and
should ensure there are no inconsistent side effects due to the multiple calls. ||
  ||getStorePreparationClass() ||getOutputFormat() ||!StoreFunc ||In the old API, getStorePreparationClass()
was the means by which the implementation could communicate to Pig the !OutputFormat to use
for writing - this is now achieved through getOutputFormat(). getOutputFormat() is NOT an
optional method and implementation SHOULD provide an !OutputFormat to use. The methods in
the !OutputFormat (and underlying !RecordWriter and !OutputCommitter) will be called by pig
in the same manner (and in the same context) as by Hadoop in a map-reduce java program. '''If
the !OutputFormat is a hadoop packaged one, the implementation should use the new API based
one in org.apache.hadoop.mapreduce. If it is a custom !OutputFormat, it should be implemented
using the new API under org.apache.hadoop.mapreduce'''. The checkOutputSpecs() method of the
!OutputFormat will be called by pig to check the output location up-front. This method will
also be called as part of the Hadoop call sequence when the job is launched. So implementations
should ensure that this method can be called multiple times without inconsistent side effects.
||
  ||bindTo() ||prepareToWrite() ||!StoreFunc ||bindTo() was the old method which would provide
an !OutputStream among other things to the !StoreFunc. The !StoreFunc implementation would
then write to the !OutputStream in putNext(). In the new API, writing of the data is through
the !OutputFormat provided by the !StoreFunc. So the equivalent call is prepareToWrite() wherein
the !RecordWriter associated with the !OutputFormat provided by the !StoreFunc is passed to
the !StoreFunc. The !RecordWriter can then be used by the implementation in putNext() to write
a tuple representing a record of data in a manner expected by the !RecordWriter. ||
@@ -468, +468 @@

  }}}
  === New Implementation ===
  {{{
- public class SimpleTextStorer implements StoreFunc {
+ public class SimpleTextStorer extends StoreFunc {
      protected RecordWriter writer = null;
  
      private byte fieldDel = '\t';
@@ -662, +662 @@

          }
      }
  
-     @Override
-     public void checkSchema(ResourceSchema s) throws IOException {
-         // nothing to do
-     }
- 
-     @Override
-     public String relToAbsPathForStoreLocation(String location, Path curDir)
-             throws IOException {
-         return LoadFunc.getAbsolutePath(location, curDir);
-     }
- 
-     @Override
-     public void setStoreFuncUDFContextSignature(String signature) {
-         // nothing to do
-     }
- 
  }
  }}}
  == Notes: ==

Mime
View raw message