pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Dai <jiany...@yahoo-inc.com>
Subject Re: decide to use Pig
Date Wed, 01 Dec 2010 00:15:23 GMT
Since I am a Pig developer, I will say "do everything Pig" :).

To be frankly, if these 9 functions are all you want, you can easily 
convert them into Pig, but you will not get too much if non of 9 
functions can utilize existing UDFs. Here is one way you can do it:

* Write a UDF LineProcess:
public class LineProcess extends EvalFunc<DataBag> {
    @Override
    public DataBag exec(Tuple in) {
        String line = (String)in.get(0);
        //initialize all the operators if they are not initialized
        if( !op1.isInitialized() )
            op1.initialize();
   
        if( !op2.isInitialized() )
            op2.initialize();
   
        //and so on with all operators
   
        //process each operator
        op1.process(line);
        String[] resultOP1 = op1.getResults();
   
        op2.process(resultOP1);
        String[][] resultOP2 = op2.getResults();
        //and so on with all the operators
     
        DataBag db = new DefaultDataBag();
      
        for (int i=0;i<resultOP9.length;i++) {
            TupleFactory.getInstance().newTuple();
            t.append(resultOP9[i]);
            db.add(t);
        }
        return db;
    }
    @Override
    public Schema outputSchema(Schema input) {
        return new Schema(new 
Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), 
input), DataType.BAG));
    }
}

* Drive it using a Pig script:
a = load '1.txt' as (a0:chararray);
b = foreach a generate flatten(LineProcess(a0));
store b into 'out';

If going forward, you want to use Filter/Join, and other native Pig 
functionality, or if you want to break these 9 functions and combine 
them in a different way, Pig will definitely help.

Daniel

Cornelio IƱigo wrote:
> Hi
>
> I'm starting with this of hadoop and Pig, I have to pass a hadoop MapReduce
> program that i made to Pig, in the hadoop program I have just a Map function
> and on it I perform all the process
> that consists to analize some text... to this 9 functions (operators) are
> called, this functions run in a secuencial mode (when the first is done, the
> second is started and so on), here is how map looks:
>
>
>         static class Map extends Mapper<LongWritable, Text, Text,
> IntWritable>{
>
>
>                  //declaration of operators or functions
>                  Operator1 op1 = new Operator1();
>                  Operator2 op2 = new Operator2();
>                  Operator3 op3 = new Operator3();
>                  ...
>                  ...
>         /*map function
>
>         */
>
>         public void map(LongWritable key, Text value, Context context)
> throws IOException, InterruptedException{
>
>                                 //get a row from csv
>                                  String line = value.toString();
>
>                                //some code to parse the line
>                                ...
>                                ...
>
>                              //initialize all the operators if they are not
> initialized
>                                if( !op1.isInitialized() )
>                                         op1.initialize();
>
>                                 if( !op2.isInitialized() )
>                                         op2.initialize();
>
>                                  ...
>                                  ...//and so on with all operators
>
>
>                                 //process each operator
>                                 op1.process(line);
>                                 String[] resultOP1 = op1.getResults();
>
>                                 op2.process(resultOP1);
>                                 String[][] resultOP2 = op2.getResults();
>                                 ...//and so on with all the operators
>                                 ...
>
>                               //finally collect results
>                                String put = "";
>                                 for( int k = 0 ; k < resultOP9.length ; k++
> ){
>                                    for( int j = 0; j < resultOP9[k].length;
> j++ ){
>
>                                         context.write...
>                                     }
>                                 }
>                             }
>         }
>     }
>
>
>
>  My question is if its a good idea or if there is a way to pass this type of
> program to Pig?
>
> Thanks
>
>   


Mime
View raw message