pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Trivial Update of "PigTutorial" by CorinneC
Date Fri, 20 Jun 2008 17:39:57 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/PigTutorial

------------------------------------------------------------------------------
  || script2-hadoop.pig || Pig Script 2, Temporal Query Phrase Popularity (Hadoop cluster)
||
  || excite-small.log || Log file, Excite search engine (local mode) ||
  || excite.log || Log file, Excite search engine (Hadoop cluster) ||
- || pornwords || Data file (porn keywords) ||
  
  The user-defined functions (UDFs) are described here.
  
  || '''UDF''' || '''Description'''||
  || !ExtractHour || Extracts the hour from the record.||
  || N!GramGenerator || Composes n-grams from the set of words. ||
- || !NonPornDetector|| Removes the record if the query field includes porn terms. ||
  || NonURLDetector || Removes the record if the query field is empty or a URL. ||
  || !ScoreGenerator || Calculates a "popularity" score for the n-gram.||
  || !ToLower || Changes the query field to lowercase. ||
@@ -131, +129 @@

  clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query;
  }}}
  
-  * Call the !NonPornDetector UDF to remove records if the query field contains porn terms.

- {{{ 
- clean3 = FILTER clean2 BY org.apache.pig.tutorial.NonPornDetector(query);
- }}}
  
   * Because the log file only contains queries for a single day, we are only interested in
the hour. The excite query log timestamp format is YYMMDDHHMMSS. Call the !ExtractHour UDF
to extract the hour (HH) from the time field.
  {{{ 
@@ -218, +212 @@

  clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query;
  }}}
  
- 
-  * Call the Non!PornDetector UDF to remove records if the query field contains porn terms.
- {{{
- clean3 = FILTER clean2 BY org.apache.pig.tutorial.NonPornDetector(query);
- }}}
- 
   
   * Because the log file only contains queries for a single day, we are only interested in
the hour. The excite query log timestamp format is YYMMDDHHMMSS. Call the !ExtractHour UDF
to extract the hour from the time field.
  {{{

Mime
View raw message