pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "PigFaq" by DavidPhillips
Date Tue, 16 Sep 2008 21:52:33 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by DavidPhillips:
http://wiki.apache.org/pig/PigFaq

The comment on the change is:
cleanup formatting and grammar

------------------------------------------------------------------------------
- '''1. I'm using !PigStorage to parse my input files. Can I make it use control characters
as delimiters?''' 
+ '''1. I'm using `PigStorage` to parse my input files. Can I make it use control characters
as delimiters?''' 
  
- Yes. The first parameter to !PigStorage is the dataset name, the second is a regular expression
to describe the delimiter. We used String.split(regex, -1) to extract fields from lines. See
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html for more information
on the way to use special characters in regex. For example "load 'input.dat' using !PigStorage('\u0001');"
will use ^A as a delimiter.
+ Yes. The first parameter to `PigStorage` is the dataset name, the second is a regular expression
to describe the delimiter. We used `String.split(regex, -1)` to extract fields from lines.
See [http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html java.util.regex.Pattern]
for more information on the way to use special characters in regex. For example,
+ 
+ {{{
+ LOAD 'input.dat' USING PigStorage('\u0001');
+ }}}
+ 
+ will use `^A` as a delimiter.
  
  '''2. Can I do a numerical comparison while filtering?'''
  
@@ -10, +16 @@

  
  '''3. How do I make my jobs run on multiple machines?'''
  
- Use the PARALLEL clause. For example =C = JOIN A by url, B by url PARALLEL 50
+ Use the `PARALLEL` clause:
  
- '''4. I would like to use Pig to read a list of .gz files that use '\u0001' as a delimiter.
How do I do that?'''
+ {{{
+ C = JOIN A by url, B by url PARALLEL 50;
+ }}}
  
- You can use the following load command: Load 'INPUT_FILE' USING <nop>!PigStorage(‘\u0001’);
+ '''4. I would like to use Pig to read a list of `.gz` files that use `'\u0001'` as a delimiter.
How do I do that?'''
+ 
+ You can use the following load command:
+ 
+ {{{
+ LOAD 'input_file' USING PigStorage('\u0001');
+ }}}
  
  '''5. Does Pig support NULLs?'''
  
@@ -22, +36 @@

  
  '''6. Does Pig support regular expressions?'''
  
- Pig does support regular expression matching via `matches` keyword. It uses java.util.regexp
matches which means your pattern has to match the entire string (ie if your string is "hi
fred" and you want to find "fred" you have to give a pattern of ".*fred" not "fred").
+ Pig does support regular expression matching via the `matches` keyword. It uses [http://java.sun.com/javase/6/docs/api/java/util/regex/package-summary.html
java.util.regex] matches which means your pattern has to match the entire string (e.g. if
your string is `"hi fred"` and you want to find `"fred"` you have to give a pattern of `".*fred"`
not `"fred"`).
  
  '''7. How do I prevent failure if some records don't have the needed number of columns?'''
  
  You can filter away those records by including the following in your Pig program:
  
- <verbatim>
+ {{{
- A = load 'foo' using !PigStorage('\t');
+ A = LOAD 'foo' USING PigStorage('\t');
  B = FILTER A BY ARITY(*) < 5;
  .....
- </verbatim>
+ }}}
  
- This code would drop all the records that has less than 5 columns.
+ This code would drop all records that have fewer than five (5) columns.
  
- '''8. Is there any difference between == and eq for numeric comparisons?'''
+ '''8. Is there any difference between `==` and `eq` for numeric comparisons?'''
  
- For equality, there is no difference while you stay in integers. However 11.0 and 11 will
be equal with == but not with eq. 
+ There is no difference when using integers. However, `11.0` and `11` will be equal with
`==` but not with `eq`. 
  
  '''9. Is it possible to use PIG with a regular Hadoop cluster (not HOD) ?'''
  
  You can set this property using the empty string.
  
+ {{{
- hod.server=”” 
+ hod.server=""
+ }}}
  
- 
- '''10. Is there an easy way for me to figure out how many rows exists in a dataset from
its alias?'''
+ '''10. Is there an easy way for me to figure out how many rows exist in a dataset from it's
alias?'''
  
  You can run the following set of commands:
  
- <verbatim>
+ {{{
- a = load 'bla' ... ;
+ a = LOAD 'bla' ... ;
- b = group a all;
- c = foreach b generate COUNT(a.$0);
- </verbatim>
+ b = GROUP a ALL;
+ c = FOREACH b GENERATE COUNT(a.$0);
+ }}}
  
+ This is equivalent to `SELECT COUNT(*)` in SQL.
- This is equivalent to select count(*) in SQL.
- 
  
  '''11. Does Pig allow grouping on expressions?'''
  
- Ans. Currently, Pig only allows to group on data fields rather than expressions. Allowing
grouping on expressions is on our road map. Stay tuned!
+ Currently, Pig only allows grouping on data fields rather than expressions. Allowing grouping
on expressions is on our roadmap. Stay tuned!
- 
  
  '''12. Is there a way to check if a map is empty?'''
  
  Currently, there is no way to do that.
  
- 
  '''13. How can I specify the number of nodes Pig allocates?'''
  
+ {{{
  > pig -Dhod.param='-m 3' my_script.pig
+ }}}
  
  Three (3) nodes is the minimum.
  
- '''14. How can I load data using "!PigStorage()" that requires Unicode specification for
separators?'''
+ '''14. How can I load data using `PigStorage()` that requires Unicode specification for
separators?'''
  
+ Old version of Pig using `'\t'`:
  
- Old version of Pig using '\t':<verbatim>a = load '/homes/yahooid/tmp/a.txt' using
!PigStorage('\t');</verbatim>
+ {{{
+ a = LOAD '/homes/yahooid/tmp/a.txt' USING PigStorage('\t');
+ }}}
  
- New version of Pig using Unicode:<verbatim>a = load '/homes/yahooid/tmp/a.txt' using
!PigStorage('\u0000B');</verbatim>
+ New version of Pig using Unicode:
  
+ {{{
+ a = LOAD '/homes/yahooid/tmp/a.txt' USING PigStorage('\u0000B');
+ }}}
+ 

Mime
View raw message