Mailing-List: contact pig-commits-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: pig-dev@hadoop.apache.org
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
From: Apache Wiki <wikidiffs@apache.org>
To: Apache Wiki <wikidiffs@apache.org>
Date: Sun, 21 Feb 2010 07:52:03 -0000
Message-ID: <20100221075203.24933.95681@eos.apache.org>
Subject: 
 =?utf-8?q?=5BPig_Wiki=5D_Update_of_=22LoadStoreMigrationGuide=22_by_Prade?=
 =?utf-8?q?epKamath?=

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for chang=
e notification.

The "LoadStoreMigrationGuide" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreMigrationGuide?action=3Ddiff&rev1=3D26&=
rev2=3D27

--------------------------------------------------

  This page describes how to migrate from the old !LoadFunc and !StoreFunc =
interface (Pig 0.1.0 through Pig 0.6.0) to the new interfaces proposed in h=
ttp://wiki.apache.org/pig/LoadStoreRedesignProposal and planned to be relea=
sed in Pig 0.7.0. Besides the example in this page, users can also look at =
!LoadFunc and !StoreFunc implementation in the piggybank codebase (contrib/=
piggybank/java/src/main/java/org/apache/pig/piggybank/storage) for examples=
 of migration. For example, !MultiStorage implements a custom !OutputFormat.
  =

- '''A general note applicable to both !LoadFunc and !StoreFunc implementat=
ions is that the implementation should use the new Hadoop 20 API based clas=
ses (!InputFormat/OutputFormat and related classes) in org.apache.hadoop.ma=
preduce package instead of the old org.apache.hadoop.mapred package.'''
+ '''A general note applicable to both !LoadFunc and !StoreFunc implementat=
ions is that the implementation should use the new Hadoop 20 API based clas=
ses (!InputFormat/OutputFormat and related classes) under the org.apache.ha=
doop.mapreduce package instead of the old org.apache.hadoop.mapred package.=
'''
  =

  The main motivation for these changes is to move closer to using Hadoop's=
 !InputFormat and !OutputFormat classes. This way pig users/developers can =
create new !LoadFunc and !StoreFunc implementation based on existing Hadoop=
 !InputFormat and !OutputFormat classes with minimal code. The complexity o=
f reading the data and creating a record will now lie in the !InputFormat a=
nd likewise on the writing end, the complexity of writing will lie in the !=
OutputFormat. This enables !Pig to easily read/write data in new storage fo=
rmats as and when an Hadoop !InputFormat and !OutputFormat is available for=
 them.
  =

@@ -21, +21 @@

  ||No equivalent method ||relativeToAbsolutePath() ||!LoadFunc ||Pig runti=
me will call this method to allow the Loader to convert a relative load loc=
ation to an absolute location. The default implementation provided in !Load=
Func handles this for hdfs files and directories. If the load source is som=
ething else, loader implementation may choose to override this. ||
  ||determineSchema() ||getSchema() ||!LoadMetadata ||determineSchema() was=
 used by old code to ask the loader to provide a schema for the data return=
ed by it - the same semantics are now achieved through getSchema() of the !=
LoadMetadata interface. !LoadMetadata is an optional interface for loaders =
to implement - if a loader does not implement it, this will indicate to the=
 pig runtime that the loader cannot return a schema for the data ||
  ||fieldsToRead() ||pushProject() ||!LoadPushDown ||fieldsToRead() was use=
d by old code to convey to the loader the exact fields required by the pig =
script -the same semantics are now achieved through pushProject() of the !L=
oadPushDown interface. !LoadPushDown is an optional interface for loaders t=
o implement - if a loader does not implement it, this will indicate to the =
pig runtime that the loader is not capable of returning just the required f=
ields and will return all fields in the data. If a loader implementation is=
 able to efficiently return only required fields, it should implement !Load=
PushDown to improve query performance ||
- ||No equivalent method ||getInputFormat() ||!LoadFunc ||This method will =
be called by Pig to get the !InputFormat used by the loader. The methods in=
 the !InputFormat (and underlying !RecordReader) will be called by pig in t=
he same manner (and in the same context) as by Hadoop in a map-reduce java =
program. '''If the !InputFormat is a hadoop packaged one, the implementatio=
n should use the new API based one in org.apache.hadoop.mapreduce. If it is=
 a custom !InputFormat, it should be implemented using the new API in org.a=
pache.hadoop.mapreduce'''||
+ ||No equivalent method ||getInputFormat() ||!LoadFunc ||This method will =
be called by Pig to get the !InputFormat used by the loader. The methods in=
 the !InputFormat (and underlying !RecordReader) will be called by pig in t=
he same manner (and in the same context) as by Hadoop in a map-reduce java =
program. '''If the !InputFormat is a hadoop packaged one, the implementatio=
n should use the new API based one under org.apache.hadoop.mapreduce. If it=
 is a custom !InputFormat, it should be implemented using the new API in or=
g.apache.hadoop.mapreduce'''||
  ||No equivalent method ||setLocation() ||!LoadFunc ||This method is calle=
d by Pig to communicate the load location to the loader. The loader should =
use this method to communicate the same information to the underlying !Inpu=
tFormat. This method is called multiple times by pig - implementations shou=
ld bear in mind that this method is called multiple times and should ensure=
 there are no inconsistent side effects due to the multiple calls. ||
  ||bindTo() ||prepareToRead() ||!LoadFunc ||bindTo() was the old method wh=
ich would provide an !InputStream among other things to the !LoadFunc. The =
!LoadFunc implementation would then read from the !InputStream in getNext()=
. In the new API, reading of the data is through the !InputFormat provided =
by the !LoadFunc. So the equivalent call is prepareToRead() wherein the !Re=
cordReader associated with the !InputFormat provided by the !LoadFunc is pa=
ssed to the !LoadFunc. The !RecordReader can then be used by the implementa=
tion in getNext() to return a tuple representing a record of data back to p=
ig. ||
  ||getNext() ||getNext() ||!LoadFunc ||The meaning of getNext() has not ch=
anged and is called by Pig runtime to get the next tuple in the data - in t=
he new API, this is the method wherein the implementation will use the the =
underlying !RecordReader and construct a tuple ||
@@ -39, +39 @@

  ||No equivalent method ||relToAbsPathForStoreLocation() ||!StoreFunc ||Pi=
g runtime will call this method to allow the Storer to convert a relative s=
tore location to an absolute location. An implementation is provided in !Lo=
adFunc (as a static method) which handles this for hdfs files and directori=
es. ||
  ||No equivalent method ||checkSchema() ||!StoreFunc ||A Store function sh=
ould implement this function to check that a given schema describing the da=
ta to be written is acceptable to it ||
  ||No equivalent method ||setStoreLocation() ||!StoreFunc ||This method is=
 called by Pig to communicate the store location to the storer. The storer =
should use this method to communicate the same information to the underlyin=
g !OutputFormat. This method is called multiple times by pig - implementati=
ons should bear in mind that this method is called multiple times and shoul=
d ensure there are no inconsistent side effects due to the multiple calls. =
||
- ||getStorePreparationClass() ||getOutputFormat() ||!StoreFunc ||In the ol=
d API, getStorePreparationClass() was the means by which the implementation=
 could communicate to Pig the !OutputFormat to use for writing - this is no=
w achieved through getOutputFormat(). getOutputFormat() is NOT an optional =
method and implementation SHOULD provide an !OutputFormat to use. The metho=
ds in the !OutputFormat (and underlying !RecordWriter and !OutputCommitter)=
 will be called by pig in the same manner (and in the same context) as by H=
adoop in a map-reduce java program. '''If the !OutputFormat is a hadoop pac=
kaged one, the implementation should use the new API based one in org.apach=
e.hadoop.mapreduce. If it is a custom !OutputFormat, it should be implement=
ed using the new API in org.apache.hadoop.mapreduce'''. The checkOutputSpec=
s() method of the !OutputFormat will be called by pig to check the output l=
ocation up-front. This method will also be called as part of the Hadoop cal=
l sequence when the job is launched. So implementations should ensure that =
this method can be called multiple times without inconsistent side effects.=
 ||
+ ||getStorePreparationClass() ||getOutputFormat() ||!StoreFunc ||In the ol=
d API, getStorePreparationClass() was the means by which the implementation=
 could communicate to Pig the !OutputFormat to use for writing - this is no=
w achieved through getOutputFormat(). getOutputFormat() is NOT an optional =
method and implementation SHOULD provide an !OutputFormat to use. The metho=
ds in the !OutputFormat (and underlying !RecordWriter and !OutputCommitter)=
 will be called by pig in the same manner (and in the same context) as by H=
adoop in a map-reduce java program. '''If the !OutputFormat is a hadoop pac=
kaged one, the implementation should use the new API based one in org.apach=
e.hadoop.mapreduce. If it is a custom !OutputFormat, it should be implement=
ed using the new API under org.apache.hadoop.mapreduce'''. The checkOutputS=
pecs() method of the !OutputFormat will be called by pig to check the outpu=
t location up-front. This method will also be called as part of the Hadoop =
call sequence when the job is launched. So implementations should ensure th=
at this method can be called multiple times without inconsistent side effec=
ts. ||
  ||bindTo() ||prepareToWrite() ||!StoreFunc ||bindTo() was the old method =
which would provide an !OutputStream among other things to the !StoreFunc. =
The !StoreFunc implementation would then write to the !OutputStream in putN=
ext(). In the new API, writing of the data is through the !OutputFormat pro=
vided by the !StoreFunc. So the equivalent call is prepareToWrite() wherein=
 the !RecordWriter associated with the !OutputFormat provided by the !Store=
Func is passed to the !StoreFunc. The !RecordWriter can then be used by the=
 implementation in putNext() to write a tuple representing a record of data=
 in a manner expected by the !RecordWriter. ||
  ||putNext() ||putNext() ||!StoreFunc ||The meaning of putNext() has not c=
hanged and is called by Pig runtime to write the next tuple of data - in th=
e new API, this is the method wherein the implementation will use the the u=
nderlying !RecordWriter to write the Tuple out ||
  ||finish() ||no equivalent method in !StoreFunc - implementations can use=
 commitTask() in !OutputCommitter ||!OutputCommitter ||finish() has been re=
moved from !StoreFunc since the same semantics can be achieved by !OutputCo=
mmitter.commitTask() - (!OutputCommitter.needsTaskCommit() should return tr=
ue to be able to use commitTask()). ||