pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "Pig070LoadStoreHowTo" by AlanGates
Date Wed, 06 Oct 2010 23:44:41 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The "Pig070LoadStoreHowTo" page has been changed by AlanGates.
http://wiki.apache.org/pig/Pig070LoadStoreHowTo?action=diff&rev1=14&rev2=15

--------------------------------------------------

  '''A general note applicable to both !LoadFunc and !StoreFunc implementations is that the
implementation should use the new Hadoop 20 API based classes (!InputFormat/OutputFormat and
related classes) under the org.apache.hadoop.mapreduce package instead of the old org.apache.hadoop.mapred
package.'''
  
  = How to implement a Loader =
- [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadFunc.java?view=markup
| LoadFunc]]  abstract class has the main methods for loading data and for most use cases
it would suffice to extend it. There are 3 other optional interfaces which can be implemented
to achieve extended functionality:
+ [[http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/LoadFunc.java?view=markup |
LoadFunc]]  abstract class has the main methods for loading data and for most use cases it
would suffice to extend it. There are 3 other optional interfaces which can be implemented
to achieve extended functionality:
-  * [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadMetadata.java?view=markup
| LoadMetadata]] has methods to deal with metadata - most implementation of loaders don't
need to implement this unless they interact with some metadata system. The getSchema() method
in this interface provides a way for loader implementations to communicate the schema of the
data back to pig. If a loader implementation returns data comprised of fields of real types
(rather than !DataByteArray fields), it should provide the schema describing the data returned
through the getSchema() method. The other methods are concerned with other types of metadata
like partition keys and statistics. Implementations can return null return values for these
methods if they are not applicable for that implementation.
+  * [[http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/LoadMetadata.java?view=markup
| LoadMetadata]] has methods to deal with metadata - most implementation of loaders don't
need to implement this unless they interact with some metadata system. The getSchema() method
in this interface provides a way for loader implementations to communicate the schema of the
data back to pig. If a loader implementation returns data comprised of fields of real types
(rather than !DataByteArray fields), it should provide the schema describing the data returned
through the getSchema() method. The other methods are concerned with other types of metadata
like partition keys and statistics. Implementations can return null return values for these
methods if they are not applicable for that implementation.
-  * [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadPushDown.java?view=markup
| LoadPushDown]] has methods to push operations from pig runtime into loader implementations
- currently only projections .i.e the pushProjection() method is called by Pig to communicate
to the loader what exact fields are required in the pig script. The loader implementation
can choose to honor the request or respond that it will not honor the request and return all
fields in the data. If a loader implementation is able to efficiently return only required
fields, it should implement !LoadPushDown to improve query performance. (Irrespective of whether
the implementation can or cannot return only the required fields, if the implementation also
implements getSchema(), the schema returned in getSchema() should be for the entire tuple
of data.)
+  * [[http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/LoadPushDown.java?view=markup
| LoadPushDown]] has methods to push operations from pig runtime into loader implementations
- currently only projections .i.e the pushProjection() method is called by Pig to communicate
to the loader what exact fields are required in the pig script. The loader implementation
can choose to honor the request or respond that it will not honor the request and return all
fields in the data. If a loader implementation is able to efficiently return only required
fields, it should implement !LoadPushDown to improve query performance. (Irrespective of whether
the implementation can or cannot return only the required fields, if the implementation also
implements getSchema(), the schema returned in getSchema() should be for the entire tuple
of data.)
-  * [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadCaster.java?view=markup
| LoadCaster]] has methods to convert byte arrays to specific types. A loader implementation
should implement this if casts (implicit or explicit) from !DataByteArray fields to other
types need to be supported.
+  * [[http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/LoadCaster.java?view=markup
| LoadCaster]] has methods to convert byte arrays to specific types. A loader implementation
should implement this if casts (implicit or explicit) from !DataByteArray fields to other
types need to be supported.
  
  The !LoadFunc abstract class is the main class to extend for implementing a loader. The
methods which need to be overriden are explained below:
   * getInputFormat() :This method will be called by Pig to get the !InputFormat used by the
loader. The methods in the !InputFormat (and underlying !RecordReader) will be called by pig
in the same manner (and in the same context) as by Hadoop in a map-reduce java program. If
the !InputFormat is a hadoop packaged one, the implementation should use the new API based
one under org.apache.hadoop.mapreduce. If it is a custom !InputFormat, it should be implemented
using the new API in org.apache.hadoop.mapreduce.  If a custom loader using a text-based !InputFormat
or a file based !InputFormat would like to read files in all subdirectories under a given
input directory recursively, then it should use the !PigFileInputFormat and !PigTextInputFormat
classes provided in org.apache.pig.backend.hadoop.executionengine.mapReduceLayer. This is
to work around the current limitation in Hadoop's !TextInputFormat and !FileInputFormat which
only read one level down from provided input directory. So for example if the input in the
load statement is 'dir1' and there are subdirs 'dir2' and 'dir2/dir3' underneath dir1, using
Hadoop's !TextInputFormat or !FileInputFormat only files under 'dir1' can be read. Using !PigFileInputFormat
or !PigTextInputFormat (or by extending them), files in all the directories can be read.
@@ -143, +143 @@

  }}}
  
  = How to implement a Storer =
- [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/StoreFunc.java?view=markup
| StoreFunc]]  abstract class has the main methods for storing data and for most use cases
it should suffice to extend it. There is an optional interface which can be implemented to
achieve extended functionality:
+ [[http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/StoreFunc.java?view=markup |
StoreFunc]]  abstract class has the main methods for storing data and for most use cases it
should suffice to extend it. There is an optional interface which can be implemented to achieve
extended functionality:
-  * [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/StoreMetadata.java?view=markup
| StoreMetadata]]: This interface has methods to interact with metadata systems to store schema
and store statistics. This interface is truely optional and should only be implemented if
metadata needs to stored.
+  * [[http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/StoreMetadata.java?view=markup
| StoreMetadata]]: This interface has methods to interact with metadata systems to store schema
and store statistics. This interface is truely optional and should only be implemented if
metadata needs to stored.
  
  The methods which need to be overridden in !StoreFunc are explained below:
   * getOutputFormat(): This method will be called by Pig to get the !OutputFormat used by
the storer. The methods in the !OutputFormat (and underlying !RecordWriter and !OutputCommitter)
will be called by pig in the same manner (and in the same context) as by Hadoop in a map-reduce
java program. If the !OutputFormat is a hadoop packaged one, the implementation should use
the new API based one under org.apache.hadoop.mapreduce. If it is a custom !OutputFormat,
it should be implemented using the new API under org.apache.hadoop.mapreduce. The checkOutputSpecs()
method of the !OutputFormat will be called by pig to check the output location up-front. This
method will also be called as part of the Hadoop call sequence when the job is launched. So
implementations should ensure that this method can be called multiple times without inconsistent
side effects. 

Mime
View raw message