hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Gaudet (JIRA)" <j...@apache.org>
Subject [jira] Created: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT
Date Thu, 30 Apr 2009 23:16:30 GMT
Command that selects a random sample of the rows, similar to LIMIT
------------------------------------------------------------------

                 Key: PIG-795
                 URL: https://issues.apache.org/jira/browse/PIG-795
             Project: Pig
          Issue Type: New Feature
          Components: impl
            Reporter: Eric Gaudet
            Priority: Trivial


When working with very large data sets (imagine that!), running a pig script can take time.
It may be useful to run on a small subset of the data in some situations (eg: debugging /
testing, or to get fast results even if less accurate.) 

The command "LIMIT N" selects the first N rows of the data, but these are not necessarily
randomzed. A command "SAMPLE X" would retain the row only with the probability x%.

Note: it is possible to implement this feature with FILTER BY and an UDF, but so is LIMIT,
and limit is built-in.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message