Return-Path: Delivered-To: apmail-hadoop-pig-commits-archive@www.apache.org Received: (qmail 89880 invoked from network); 21 Feb 2010 07:52:25 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 21 Feb 2010 07:52:25 -0000 Received: (qmail 76370 invoked by uid 500); 21 Feb 2010 07:52:25 -0000 Delivered-To: apmail-hadoop-pig-commits-archive@hadoop.apache.org Received: (qmail 76325 invoked by uid 500); 21 Feb 2010 07:52:24 -0000 Mailing-List: contact pig-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: pig-dev@hadoop.apache.org Delivered-To: mailing list pig-commits@hadoop.apache.org Received: (qmail 76316 invoked by uid 500); 21 Feb 2010 07:52:24 -0000 Delivered-To: apmail-incubator-pig-commits@incubator.apache.org Received: (qmail 76313 invoked by uid 99); 21 Feb 2010 07:52:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 21 Feb 2010 07:52:24 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 21 Feb 2010 07:52:23 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id 587FF17D17 for ; Sun, 21 Feb 2010 07:52:03 +0000 (GMT) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Apache Wiki To: Apache Wiki Date: Sun, 21 Feb 2010 07:52:03 -0000 Message-ID: <20100221075203.24933.95681@eos.apache.org> Subject: =?utf-8?q?=5BPig_Wiki=5D_Update_of_=22LoadStoreMigrationGuide=22_by_Prade?= =?utf-8?q?epKamath?= Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for chang= e notification. The "LoadStoreMigrationGuide" page has been changed by PradeepKamath. http://wiki.apache.org/pig/LoadStoreMigrationGuide?action=3Ddiff&rev1=3D26&= rev2=3D27 -------------------------------------------------- This page describes how to migrate from the old !LoadFunc and !StoreFunc = interface (Pig 0.1.0 through Pig 0.6.0) to the new interfaces proposed in h= ttp://wiki.apache.org/pig/LoadStoreRedesignProposal and planned to be relea= sed in Pig 0.7.0. Besides the example in this page, users can also look at = !LoadFunc and !StoreFunc implementation in the piggybank codebase (contrib/= piggybank/java/src/main/java/org/apache/pig/piggybank/storage) for examples= of migration. For example, !MultiStorage implements a custom !OutputFormat. = - '''A general note applicable to both !LoadFunc and !StoreFunc implementat= ions is that the implementation should use the new Hadoop 20 API based clas= ses (!InputFormat/OutputFormat and related classes) in org.apache.hadoop.ma= preduce package instead of the old org.apache.hadoop.mapred package.''' + '''A general note applicable to both !LoadFunc and !StoreFunc implementat= ions is that the implementation should use the new Hadoop 20 API based clas= ses (!InputFormat/OutputFormat and related classes) under the org.apache.ha= doop.mapreduce package instead of the old org.apache.hadoop.mapred package.= ''' = The main motivation for these changes is to move closer to using Hadoop's= !InputFormat and !OutputFormat classes. This way pig users/developers can = create new !LoadFunc and !StoreFunc implementation based on existing Hadoop= !InputFormat and !OutputFormat classes with minimal code. The complexity o= f reading the data and creating a record will now lie in the !InputFormat a= nd likewise on the writing end, the complexity of writing will lie in the != OutputFormat. This enables !Pig to easily read/write data in new storage fo= rmats as and when an Hadoop !InputFormat and !OutputFormat is available for= them. = @@ -21, +21 @@ ||No equivalent method ||relativeToAbsolutePath() ||!LoadFunc ||Pig runti= me will call this method to allow the Loader to convert a relative load loc= ation to an absolute location. The default implementation provided in !Load= Func handles this for hdfs files and directories. If the load source is som= ething else, loader implementation may choose to override this. || ||determineSchema() ||getSchema() ||!LoadMetadata ||determineSchema() was= used by old code to ask the loader to provide a schema for the data return= ed by it - the same semantics are now achieved through getSchema() of the != LoadMetadata interface. !LoadMetadata is an optional interface for loaders = to implement - if a loader does not implement it, this will indicate to the= pig runtime that the loader cannot return a schema for the data || ||fieldsToRead() ||pushProject() ||!LoadPushDown ||fieldsToRead() was use= d by old code to convey to the loader the exact fields required by the pig = script -the same semantics are now achieved through pushProject() of the !L= oadPushDown interface. !LoadPushDown is an optional interface for loaders t= o implement - if a loader does not implement it, this will indicate to the = pig runtime that the loader is not capable of returning just the required f= ields and will return all fields in the data. If a loader implementation is= able to efficiently return only required fields, it should implement !Load= PushDown to improve query performance || - ||No equivalent method ||getInputFormat() ||!LoadFunc ||This method will = be called by Pig to get the !InputFormat used by the loader. The methods in= the !InputFormat (and underlying !RecordReader) will be called by pig in t= he same manner (and in the same context) as by Hadoop in a map-reduce java = program. '''If the !InputFormat is a hadoop packaged one, the implementatio= n should use the new API based one in org.apache.hadoop.mapreduce. If it is= a custom !InputFormat, it should be implemented using the new API in org.a= pache.hadoop.mapreduce'''|| + ||No equivalent method ||getInputFormat() ||!LoadFunc ||This method will = be called by Pig to get the !InputFormat used by the loader. The methods in= the !InputFormat (and underlying !RecordReader) will be called by pig in t= he same manner (and in the same context) as by Hadoop in a map-reduce java = program. '''If the !InputFormat is a hadoop packaged one, the implementatio= n should use the new API based one under org.apache.hadoop.mapreduce. If it= is a custom !InputFormat, it should be implemented using the new API in or= g.apache.hadoop.mapreduce'''|| ||No equivalent method ||setLocation() ||!LoadFunc ||This method is calle= d by Pig to communicate the load location to the loader. The loader should = use this method to communicate the same information to the underlying !Inpu= tFormat. This method is called multiple times by pig - implementations shou= ld bear in mind that this method is called multiple times and should ensure= there are no inconsistent side effects due to the multiple calls. || ||bindTo() ||prepareToRead() ||!LoadFunc ||bindTo() was the old method wh= ich would provide an !InputStream among other things to the !LoadFunc. The = !LoadFunc implementation would then read from the !InputStream in getNext()= . In the new API, reading of the data is through the !InputFormat provided = by the !LoadFunc. So the equivalent call is prepareToRead() wherein the !Re= cordReader associated with the !InputFormat provided by the !LoadFunc is pa= ssed to the !LoadFunc. The !RecordReader can then be used by the implementa= tion in getNext() to return a tuple representing a record of data back to p= ig. || ||getNext() ||getNext() ||!LoadFunc ||The meaning of getNext() has not ch= anged and is called by Pig runtime to get the next tuple in the data - in t= he new API, this is the method wherein the implementation will use the the = underlying !RecordReader and construct a tuple || @@ -39, +39 @@ ||No equivalent method ||relToAbsPathForStoreLocation() ||!StoreFunc ||Pi= g runtime will call this method to allow the Storer to convert a relative s= tore location to an absolute location. An implementation is provided in !Lo= adFunc (as a static method) which handles this for hdfs files and directori= es. || ||No equivalent method ||checkSchema() ||!StoreFunc ||A Store function sh= ould implement this function to check that a given schema describing the da= ta to be written is acceptable to it || ||No equivalent method ||setStoreLocation() ||!StoreFunc ||This method is= called by Pig to communicate the store location to the storer. The storer = should use this method to communicate the same information to the underlyin= g !OutputFormat. This method is called multiple times by pig - implementati= ons should bear in mind that this method is called multiple times and shoul= d ensure there are no inconsistent side effects due to the multiple calls. = || - ||getStorePreparationClass() ||getOutputFormat() ||!StoreFunc ||In the ol= d API, getStorePreparationClass() was the means by which the implementation= could communicate to Pig the !OutputFormat to use for writing - this is no= w achieved through getOutputFormat(). getOutputFormat() is NOT an optional = method and implementation SHOULD provide an !OutputFormat to use. The metho= ds in the !OutputFormat (and underlying !RecordWriter and !OutputCommitter)= will be called by pig in the same manner (and in the same context) as by H= adoop in a map-reduce java program. '''If the !OutputFormat is a hadoop pac= kaged one, the implementation should use the new API based one in org.apach= e.hadoop.mapreduce. If it is a custom !OutputFormat, it should be implement= ed using the new API in org.apache.hadoop.mapreduce'''. The checkOutputSpec= s() method of the !OutputFormat will be called by pig to check the output l= ocation up-front. This method will also be called as part of the Hadoop cal= l sequence when the job is launched. So implementations should ensure that = this method can be called multiple times without inconsistent side effects.= || + ||getStorePreparationClass() ||getOutputFormat() ||!StoreFunc ||In the ol= d API, getStorePreparationClass() was the means by which the implementation= could communicate to Pig the !OutputFormat to use for writing - this is no= w achieved through getOutputFormat(). getOutputFormat() is NOT an optional = method and implementation SHOULD provide an !OutputFormat to use. The metho= ds in the !OutputFormat (and underlying !RecordWriter and !OutputCommitter)= will be called by pig in the same manner (and in the same context) as by H= adoop in a map-reduce java program. '''If the !OutputFormat is a hadoop pac= kaged one, the implementation should use the new API based one in org.apach= e.hadoop.mapreduce. If it is a custom !OutputFormat, it should be implement= ed using the new API under org.apache.hadoop.mapreduce'''. The checkOutputS= pecs() method of the !OutputFormat will be called by pig to check the outpu= t location up-front. This method will also be called as part of the Hadoop = call sequence when the job is launched. So implementations should ensure th= at this method can be called multiple times without inconsistent side effec= ts. || ||bindTo() ||prepareToWrite() ||!StoreFunc ||bindTo() was the old method = which would provide an !OutputStream among other things to the !StoreFunc. = The !StoreFunc implementation would then write to the !OutputStream in putN= ext(). In the new API, writing of the data is through the !OutputFormat pro= vided by the !StoreFunc. So the equivalent call is prepareToWrite() wherein= the !RecordWriter associated with the !OutputFormat provided by the !Store= Func is passed to the !StoreFunc. The !RecordWriter can then be used by the= implementation in putNext() to write a tuple representing a record of data= in a manner expected by the !RecordWriter. || ||putNext() ||putNext() ||!StoreFunc ||The meaning of putNext() has not c= hanged and is called by Pig runtime to write the next tuple of data - in th= e new API, this is the method wherein the implementation will use the the u= nderlying !RecordWriter to write the Tuple out || ||finish() ||no equivalent method in !StoreFunc - implementations can use= commitTask() in !OutputCommitter ||!OutputCommitter ||finish() has been re= moved from !StoreFunc since the same semantics can be achieved by !OutputCo= mmitter.commitTask() - (!OutputCommitter.needsTaskCommit() should return tr= ue to be able to use commitTask()). ||