pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "LoadStoreMigrationGuide" by PradeepKamath
Date Wed, 10 Feb 2010 21:54:22 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The "LoadStoreMigrationGuide" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreMigrationGuide?action=diff&rev1=9&rev2=10

--------------------------------------------------

  
  The main change is that the new !LoadFunc API is based on a !InputFormat to read the data.
Implementations can choose to use existing !InputFormats like !TextInputFormat or implement
a new one.
   
- == Table mapping old API calls to new API calls ==
+ == Table mapping old API calls to new API calls in rough order of call sequence==
  || '''Old Method in !LoadFunc''' || '''Equivalent New Method''' || '''New Class/Interface
in which method is present''' || '''Explanation''' ||
+ || No equivalent method || setUDFContextSignature() || !LoadFunc || This method will be
called by Pig both in the front end and back end to pass a unique signature to the Loader.
The signature can be used to store into the UDFContext} any information which the Loader needs
to store between various method invocations in the front end and back end. A use case is to
store !RequiredFieldList passed to it in !LoadPushDown.pushProjection(!RequiredFieldList)
for use in the back end before returning tuples in getNext()||
+ || No equivalent method || relativeToAbsolutePath() || !LoadFunc || Pig runtime will call
this method to allow the Loader to convert a relative load location to an absolute location.
The default implementation provided in !LoadFunc handles this for hdfs files and directories.
If the load source is something else, loader implementation may choose to override this.||
+ || determineSchema() || getSchema() || !LoadMetadata || determineSchema() was used by old
code to ask the loader to provide a schema for the data returned by it - the same semantics
are now achieved through getSchema() of the !LoadMetadata interface. !LoadMetadata is an optional
interface for loaders to implement - if a loader does not implement it, this will indicate
to the pig runtime that the loader cannot return a schema for the data ||
+ || fieldsToRead() || pushProject() || !LoadPushDown || fieldsToRead() was used by old code
to convey to the loader the exact fields required by the pig script -the same semantics are
now achieved through pushProject() of the !LoadPushDown interface. !LoadPushDown is an optional
interface for loaders to implement - if a loader does not implement it, this will indicate
to the pig runtime that the loader is not capable of returning just the required fields and
will return all fields in the data. If a loader implementation is able to efficiently return
only required fields, it should implement !LoadPushDown to improve query performance||
+ || No equivalent method || getInputFormat() ||!LoadFunc ||  This method will be called by
Pig to get the !InputFormat used by the loader. The methods in the !InputFormat (and underlying
!RecordReader) will be called by pig in the same manner (and in the same context) as by Hadoop
in a map-reduce java program.||
+ || No equivalent method || setLocation() || !LoadFunc || This method is called by Pig to
communicate the load location to the loader. The loader should use this method to communicate
the same information to the underlying !InputFormat. This method is called multiple times
by pig - implementations should bear in mind that this method is called multiple times and
should ensure there are no inconsistent side effects due to the multiple calls.||
  || bindTo() || prepareToRead() || !LoadFunc || bindTo() was the old method which would provide
an !InputStream among other things to the !LoadFunc. The !LoadFunc implementation would then
read from the !InputStream in getNext(). In the new API, reading of the data is through the
!InputFormat provided by the !LoadFunc. So the equivalent call is prepareToRead() wherein
the !RecordReader associated with the !InputFormat provided by the !LoadFunc is passed to
the !LoadFunc. The !RecordReader can then be used by the implementation in getNext() to return
a tuple representing a record of data back to pig. ||
  || getNext() || getNext() || !LoadFunc || The meaning of getNext() has not changed and is
called by Pig runtime to get the next tuple in the data ||
  || bytesToInteger(),...bytesToBag() ||  bytesToInteger(),...bytesToBag() || !LoadCaster
|| The meaning of these methods has not changed and is called by Pig runtime to cast a !DataByteArray
fields to the right type when needed. In the new API, a !LoadFunc implementation should give
a !LoadCaster object back to pig as the return value of getLoadCaster() method so that it
can be used for casting. If a null is returned then casting from !DataByteArray to any other
type (implicitly or explicitly) in the pig script will not be possible ||
- || fieldsToRead() || pushProject() || !LoadPushDown || fieldsToRead() was used by old code
to convey to the loader the exact fields required by the pig script -the same semantics are
now achieved through pushProject() of the !LoadPushDown interface. !LoadPushDown is an optional
interface for loaders to implement - if a loader does not implement it, this will indicate
to the pig runtime that the loader is not capable of returning just the required fields and
will return all fields in the data. If a loader implementation is able to efficiently return
only required fields, it should implement !LoadPushDown to improve query performance||
- || determineSchema() || getSchema() || !LoadMetadata || determineSchema() was used by old
code to ask the loader to provide a schema for the data returned by it - the same semantics
are now achieved through getSchema() of the !LoadMetadata interface. !LoadMetadata is an optional
interface for loaders to implement - if a loader does not implement it, this will indicate
to the pig runtime that the loader cannot return a schema for the data ||
- || No equivalent method || relativeToAbsolutePath() || Pig runtime will call this method
to allow the Loader to convert a relative load location to an absolute location. The default
implementation provided in !LoadFunc handles this for hdfs files and directories. If the load
source is something else, loader implementation may choose to override this.||
- || No equivalent method || getInputFormat() || This method will be called by Pig to get
the !InputFormat used by the loader. The methods in the !InputFormat (and underlying !RecordReader)
will be called by pig in the same manner (and in the same context) as by Hadoop in a map-reduce
java program.||
- || No equivalent method || setLocation() || This method is called by Pig to communicate
the load location to the loader. The loader should use this method to communicate the same
information to the underlying !InputFormat. This method is called multiple times by pig -
implementations should bear in mind that this method is called multiple times and should ensure
there are no inconsistent side effects due to the multiple calls.||
- || No equivalent method || setUDFContextSignature() || This method will be called by Pig
both in the front end and back end to pass a unique signature to the Loader. The signature
can be used to store into the UDFContext} any information which the Loader needs to store
between various method invocations in the front end and back end. A use case is to store RequiredFieldList
passed to it in LoadPushDown.pushProjection(RequiredFieldList) for use in the back end before
returning tuples in getNext()||
  
-  An example of how a simple !LoadFunc implementation based on old interface can be converted
to the new interfaces will be shown below. The loader implementation in the example is a loader
for text data with line delimiter as '\n' and '\t' as default field delimiter (which can be
overridden by passing a different field delimiter in the constructor) - this is similar to
current !PigStorage loader in Pig.
+  An example of how a simple !LoadFunc implementation based on old interface can be converted
to the new interfaces will be shown below. The loader implementation in the example is a loader
for text data with line delimiter as '\n' and '\t' as default field delimiter (which can be
overridden by passing a different field delimiter in the constructor) - this is similar to
current !PigStorage loader in Pig. The new implementation uses an existing Hadoop supported
!Inputformat - !TextInputFormat as the underlying !InputFormat.
  
  == Old Implementation ==
  {{{

Mime
View raw message