pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "LoadStoreMigrationGuide" by PradeepKamath
Date Wed, 17 Feb 2010 20:04:49 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The "LoadStoreMigrationGuide" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreMigrationGuide?action=diff&rev1=20&rev2=21

--------------------------------------------------

  This page describes how to migrate from the old !LoadFunc and !StoreFunc interface (Pig
0.1.0 through Pig 0.6.0) to the new interfaces proposed in http://wiki.apache.org/pig/LoadStoreRedesignProposal
and planned to be released in Pig 0.7.0. Besides the example in this page, users can also
look at !LoadFunc and !StoreFunc implementation in the piggybank codebase (contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage)
for examples of migration. For example, !MultiStorage implements a custom !OutputFormat.
  
- A general note applicable to both !LoadFunc and !StoreFunc implementations is that the implementation
should use the new Hadoop 20 API based on org.apache.hadoop.mapreduce package instead of the
old org.apache.hadoop.mapred package.
+ *A general note applicable to both !LoadFunc and !StoreFunc implementations is that the
implementation should use the new Hadoop 20 API based on org.apache.hadoop.mapreduce package
instead of the old org.apache.hadoop.mapred package.*
  
- The main motivation for these changes is to move closer to using !Hadoop's !InputFormat
and !OutputFormat classes. This way pig users/developers can create new !LoadFunc and !StoreFunc
implementation based on existing !Hadoop !InputFormat and !OutputFormat classes with minimal
code. The complexity of reading the data and creating a record will now lie in the !InputFormat
and likewise on the writing end, the complexity of writing will lie in the !OutputFormat.
This enables !Pig to easily read/write data in new storage formats as and when an !Hadoop
!InputFormat and !OutputFormat is available for them.
+ The main motivation for these changes is to move closer to using Hadoop's !InputFormat and
!OutputFormat classes. This way pig users/developers can create new !LoadFunc and !StoreFunc
implementation based on existing Hadoop !InputFormat and !OutputFormat classes with minimal
code. The complexity of reading the data and creating a record will now lie in the !InputFormat
and likewise on the writing end, the complexity of writing will lie in the !OutputFormat.
This enables !Pig to easily read/write data in new storage formats as and when an Hadoop !InputFormat
and !OutputFormat is available for them.
  
  
  = LoadFunc Migration =
@@ -17, +17 @@

   
  == Table mapping old API calls to new API calls in rough order of call sequence ==
  || '''Old Method in !LoadFunc''' || '''Equivalent New Method''' || '''New Class/Interface
in which method is present''' || '''Explanation''' ||
- || No equivalent method || setUDFContextSignature() || !LoadFunc || This method will be
called by Pig both in the front end and back end to pass a unique signature to the Loader.
The signature can be used to store into the UDFContext} any information which the Loader needs
to store between various method invocations in the front end and back end. A use case is to
store !RequiredFieldList passed to it in !LoadPushDown.pushProjection(!RequiredFieldList)
for use in the back end before returning tuples in getNext()||
+ || No equivalent method || setUDFContextSignature() || !LoadFunc || This method will be
called by Pig both in the front end and back end to pass a unique signature to the Loader.
The signature can be used to store into the !UDFContext any information which the Loader needs
to store between various method invocations in the front end and back end. A use case is to
store !RequiredFieldList passed to it in !LoadPushDown.pushProjection(!RequiredFieldList)
for use in the back end before returning tuples in getNext()||
  || No equivalent method || relativeToAbsolutePath() || !LoadFunc || Pig runtime will call
this method to allow the Loader to convert a relative load location to an absolute location.
The default implementation provided in !LoadFunc handles this for hdfs files and directories.
If the load source is something else, loader implementation may choose to override this.||
  || determineSchema() || getSchema() || !LoadMetadata || determineSchema() was used by old
code to ask the loader to provide a schema for the data returned by it - the same semantics
are now achieved through getSchema() of the !LoadMetadata interface. !LoadMetadata is an optional
interface for loaders to implement - if a loader does not implement it, this will indicate
to the pig runtime that the loader cannot return a schema for the data ||
  || fieldsToRead() || pushProject() || !LoadPushDown || fieldsToRead() was used by old code
to convey to the loader the exact fields required by the pig script -the same semantics are
now achieved through pushProject() of the !LoadPushDown interface. !LoadPushDown is an optional
interface for loaders to implement - if a loader does not implement it, this will indicate
to the pig runtime that the loader is not capable of returning just the required fields and
will return all fields in the data. If a loader implementation is able to efficiently return
only required fields, it should implement !LoadPushDown to improve query performance||
@@ -35, +35 @@

  
  == Table mapping old API calls to new API calls in rough order of call sequence ==
  || '''Old Method in !StoreFunc''' || '''Equivalent New Method''' || '''New Class/Interface
in which method is present''' || '''Explanation''' ||
- || No equivalent method || setStoreFuncUDFContextSignature() || !StoreFunc || This method
will be called by Pig both in the front end and back end to pass a unique signature to the
Storer. The signature can be used to store into the UDFContext} any information which the
Storer needs to store between various method invocations in the front end and back end.||
+ || No equivalent method || setStoreFuncUDFContextSignature() || !StoreFunc || This method
will be called by Pig both in the front end and back end to pass a unique signature to the
Storer. The signature can be used to store into the UDFContext any information which the Storer
needs to store between various method invocations in the front end and back end.||
- || No equivalent method || relToAbsPathForStoreLocation() || !StoreFunc || Pig runtime will
call this method to allow the Storer to convert a relative load location to an absolute location.
An implementation is provided in !LoadFunc (as a static method) which handles this for hdfs
files and directories.||
+ || No equivalent method || relToAbsPathForStoreLocation() || !StoreFunc || Pig runtime will
call this method to allow the Storer to convert a relative store location to an absolute location.
An implementation is provided in !LoadFunc (as a static method) which handles this for hdfs
files and directories.||
- || No equivalent method || checkSchema() || !StoreFunc || A Store function should implement
this function to check that a given schema is acceptable to it ||
+ || No equivalent method || checkSchema() || !StoreFunc || A Store function should implement
this function to check that a given schema describing the data to be written is acceptable
to it ||
  || No equivalent method || setStoreLocation() || !StoreFunc || This method is called by
Pig to communicate the store location to the storer. The storer should use this method to
communicate the same information to the underlying !OutputFormat. This method is called multiple
times by pig - implementations should bear in mind that this method is called multiple times
and should ensure there are no inconsistent side effects due to the multiple calls.||
- || getStorePreparationClass() || getOutputFormat() || !StoreFunc ||In the old API, getStorePreparationClass()
was the means by which the implementation could communicate to Pig the !OutputFormat to use
for writing - this is now achieved through getOutputFormat(). getOutputFormat() is NOT an
optional method and implementation SHOULD provide an !OutputFormat to use. The methods in
the !OutputFormat (and underlying !RecordWriter and !OutputCommitter) will be called by pig
in the same manner (and in the same context) as by Hadoop in a map-reduce java program.||
+ || getStorePreparationClass() || getOutputFormat() || !StoreFunc ||In the old API, getStorePreparationClass()
was the means by which the implementation could communicate to Pig the !OutputFormat to use
for writing - this is now achieved through getOutputFormat(). getOutputFormat() is NOT an
optional method and implementation SHOULD provide an !OutputFormat to use. The methods in
the !OutputFormat (and underlying !RecordWriter and !OutputCommitter) will be called by pig
in the same manner (and in the same context) as by Hadoop in a map-reduce java program. The
checkOutputSpecs() method of the !OutputFormat will be called by pig to check the output location
up-front. This method will also be called as part of the Hadoop call sequence when the job
is launched. So implementations should ensure that this method can be called multiple times
without inconsistent side effects.||
  || bindTo() || prepareToWrite() || !StoreFunc || bindTo() was the old method which would
provide an !OutputStream among other things to the !StoreFunc. The !StoreFunc implementation
would then write to the !OutputStream in putNext(). In the new API, writing of the data is
through the !OutputFormat provided by the !StoreFunc. So the equivalent call is prepareToWrite()
wherein the !RecordWriter associated with the !OutputFormat provided by the !StoreFunc is
passed to the !StoreFunc. The !RecordWriter can then be used by the implementation in putNext()
to write a tuple representing a record of data in a manner expected by the !RecordWriter.
||
  || putNext() || putNext() || !StoreFunc || The meaning of putNext() has not changed and
is called by Pig runtime to write the next tuple of data - in the new API, this is the method
wherein the implementation will use the the underlying !RecordWriter to write the Tuple out
||
  || finish() || no equivalent method in !StoreFunc - implementations can use close() in !RecordWriter
or commitTask in !OutputCommitter || !RecordWriter or !OutputCommitter || finish() has been
removed from !StoreFunc since the same semantics can be achieved by !RecordWriter.close()
or !OutputCommitter.commitTask() - in the latter case !OutputCommitter.needsTaskCommit() should
return true.||

Mime
View raw message