pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "EmbeddedPig" by OlgaN
Date Wed, 07 Nov 2007 21:22:43 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/EmbeddedPig

New page:
== Embedding Pig In Java Programs == 

Sometimes you want more control than Pig scripts can give you. If so, you can embed Pig Latin
in Java (just like SQL can be embedded in programs using JDBC). 

The following steps need to be carried out:

 * Make sure `pig.jar` is on your classpath.
 * Create an instance of `PigServer`
 * Issue commands through that PigServer by calling `PigServer.registerQuery()`.  
 * To retrieve results, either call `PigServer.openIterator()` or `PigServer.store()`.
 * If you have user defined functions, register them by calling `PigServer.registerJar()`.

=== Example ===

Lets assume that I need to count the number of occurrences of each word in a document. Lets
also assume that you have EvalFunction `Tokenize` that parses a line of text and returns all
the words for that line. The function is located in `/mylocation/tokenize.jar`.

PigLatin script for this computation will look as follows:

{{{
register /mylocation/tokenize.jar
A = load 'mytext' using TextLoader();
B = foreach A generate flatten(tokenize($0));
C = group B by $1;
D = foreach C generate flatten(group), COUNT(B.$0);
store D into 'myoutput';
}}}

The same can be accomplished with the following Java program

{{{

import java.io.IOException;
import org.apache.pig.PigServer;

public class WordCount {
   public static void main(String[] args) {
      
      PigServer pigServer = new PigServer();
        
      try {
         pigServer.registerJar("/mylocation/tokenize.jar");
         runMyQuery(pigServer, "myinput.txt";
        } catch (IOException e) {
         e.printStackTrace();
      }
   }
   
   public static void runMyQuery(PigServer pigServer, String inputFile) throws IOException
{        
       pigServer.registerQuery("A = load '" + inputFile + "' using TextLoader();");
       pigServer.registerQuery("B = foreach A generate flatten(tokenize($0));");
       pigServer.registerQuery("C = group B by $1;");
       pigServer.registerQuery("D = foreach C generate flatten(group), COUNT(B.$0);");
      
       pigServer.store("D", "myoutput");
   }
}

}}}

Notes:

 * The jar which contains your functions must be registered.
 * The four calls to `pigServer.registerQuery()` simply cause the query to be parsed and enquired.
The query is not actually executed until `pigServer.store()` is called.
 * The input data referred to on the load statement, must be on DFS in the specified location.
 * The final result is placed into `myoutput` file in the your current working directory on
DFS. (By default this is your home directory on DFS.)

To run your program, you need to first compile it by using the following command:

{{{
javac -cp <path>pig.jar WordCount.java
}}}

If the compilation is successful, you can then run your program:

{{{
java -cp <path>pig.jar WordCount
}}}

Mime
View raw message