pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "ProposedRoadMap" by AlanGates
Date Wed, 07 Nov 2007 18:33:36 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by AlanGates:
http://wiki.apache.org/pig/ProposedRoadMap

New page:
[[Anchor(Pig_Road_Map)]]
= Pig Road Map =

The following document was developed as a roadmap for pig at Yahoo prior to pig being released
as open source. 

[[Anchor(Abstract)]]
== Abstract ==
This document lays out the road map for pig as of the end of Q3 2007.  It begins by laying
out the [#Pig_Vision vision] for pig and describing its
[#Current_Status current state].  It [#Development_Categories categorizes] different features
to be worked on and discusses
the priorities of those categories.  Each feature is described in [#Feature_Details detail].
 Finally, it discusses how individual
features will be [#Priorities prioritized].

[[Anchor(Pig_Vision)]]
== Pig Vision ==
The pig query language provides several significant advantages compared to straight map reduce:
   1. There are common data query operations that most, if not all, users querying data make
use of.  These include project, filter, aggregate, join, and sort (all of the common data
base relational operators).  Pig provides these in the map reduce framework so that users
need not implement these common operators over and over, while still allowing the user to
include their custom code for non-common operations.
   2. The use of a declarative language opens up grid data queries to users who do not have
the expertise, time, or inclination to write java programs to answer their queries.
   3. A great majority of data queries can be answered using the provided operators.  A declarative
language greatly reduces the development effort needed by users to make use of grid data.
   4. An interactive shell allows users to easily experiment with their data and queries in
order to shorten their development cycle.

All of this comes at a price in performance and flexibility.  An expert prorammer will always
be able to write more efficient code in a lower level
language to execute a given job than can be done in general purpose higher level language.
 Also, some types of jobs may not fit nicely into pig's query
model.  For these reasons it is not envisioned that pig should become the only way to access
data on the grid.  However, given sufficient performance,
stability, and ease of use, the majority of users will find pig an easier way to write, test,
and maintain their code.

Given this vision, the goals for the pig team over the next twelve months are:
   1. Make pig a stable, usable, production quality product.
   2. Make pig user friendly in terms of understanding the data being queried, how queries
are executed, etc.
   3. Provide performance at or near the same level available from implementing similar operations
directly in map reduce.
   4. Provide flexibility for users to integrate their code, in any of several supported common
languages, into their pig queries.
   6. Provide users a quality shell in which to navigate the HDFS and do query development.
   7. Provide tools for grid administrators to understand how their data is being queried
and created via pig.

[[Anchor(Current_Status)]]
== Current Status ==
The current status of pig, measured against the seven goals above is:
   1. Release 1.2 of pig focussed mainly on goal 1 of making pig stable and production quality.
 There is still work to be done in this, particulary in the areas of error handling and documentation.
 
   2. Very little support is given to users at this time to assist them in understanding the
structure of the data they are querying, how their queries will be executed, etc.
   3. Pig performance against similar code written directly in map reduce has not been tested.
   4. Users can integrate java code with pig.  Other languages are not supported.  The supported
java code must be a function that accepts a row at a time and outputs one or more rows in
response.  Other models, such as streaming, are not supported.
   6. The grunt shell provides the ability to browse the HDFS and to create pig queries. 
It is very rudimentary, lacking many features of a modern shell.
   7. There are no tools currently to allow a grid administrator to monitor how pig is being
used on the grid.

[[Anchor(Development_Categories)]]
== Development Categories ==
Features under consideration for development are categorized as:
|| '''Category''' || '''Priority''' || '''Types of Features''' || '''Comments''' ||
|| Engineering manageability || High || release engineering, testing || ||
|| Infrastructure || High || architectural changes, such as metadata, types || ||
|| Performance || High / Medium || || Changes that bring pig into near parity with direct
map reduce are high priority, others are medium ||
|| Production quality || Medium ||  good error handling, reliability || ||
|| User experience || Medium || ease of use, providing adequate information for the user ||
||
|| Security || Low || || Do not have requirements or security support from HDFS yet ||

[[Anchor(Feature_Details)]]
== Feature Details ==
For each of the features, the following information is given:

'''Explanation''':  What this feature is.

'''Rational''':  Why it is needed.  This should include a reference to which of the pig goals
mentioned above that this meets.

'''Category''':  Which of the above [#Development_Categories categories] the feature fits
into.

'''Requestor''':  Indicates whether the feature was requested by a user, development, qa,
administartor (that is grid administrators), or management.

'''Depends On''':  Other featuers that must be implemented before this one can.  Most of these
are internal, but a few are external dependencies.

'''Current Situation''':  The feature may be totally non-existent, already partially addressed
in current pig, under design, or under development.

'''Estimated Development Effort''':
 * small:  less than 1 person month
 * medium:  1-3 person months
 * large:  more than 3 person months

'''Urgency''':
 * low:  only requested by a few users
 * medium:  requested by many users or adds significant value to the language
 * high:  requested ubiqitously, adds tremendous value to the language, or enables a number
of other changes

[[Anchor(Describe_Schema)]]
=== Describe Schema ===
'''Explanation''':  Users need to be able to see the schema of data they wish to query.  This
includes finding the schema of data that is not yet being
queried (e.g. `describe '/user/me/mydata'`) and determining the schema of a relation after
several pig operations, e.g.:
{{{
a = load('/user/me/mydata') as (query, urls, user_clicked);
b = filter a by user_clicked >= '1';
c = group b by query;
describe c;
}}}

Also, this should allow the user to see the schema that will result from merging several files,
e.g.
{{{
a = load('/user/me/mydata/*')
describe a;
}}}
should return the merged schema of all the files in the directory `/user/me/mydata`.

'''Rational''':  Goal 2.

'''Category''':  User experience.

'''Requestor''':  Users, Development. 

'''Depends On''': [#Metadata Metadata]

'''Current Situation''':  As of pig 1.2 the second type of describe (schema of a current relation)
can be done.  Describing the schema of a file cannot be
done until metadata about the file exists.

'''Estimated Development Effort''':  Small.

'''Urgency''':  Medium.

[[Anchor(Documentation)]]
=== Documentation ===
'''Explanation''':  Pig users need comprehensive user documentation that will help them write
queries, write user defined functions, submit their pig scripts
to the cluster, etc.  

'''Rational''':  Goal 1.

'''Requestor''':  Users, Development. 

'''Category''':  User experience.

'''Depends On''':  

'''Current Situation''':  There are a few existing wiki pages at http://wiki.apache.org/pig/.
However, there is not enough at this time to allow users to get started
with pig, and what there is is not collected into a coherent whole with an index etc.

'''Estimated Development Effort''':  Medium.  This effort requires expertise not currently
existing in the Pig development team.  A
document writer is needed to assist the team in this.

'''Urgency''':  High.

[[Anchor(Early_Error_Detection_and_Failure)]]
=== Early Error Detection and Failure ===
'''Explanation''':  Errors should be detected in pig queries at the earliest possible opportunity.
 This includes syntax parsing that should be done
before pig connects to HOD at all, and type checking that should be done before the jobs are
submitted to Map Reduce.

'''Rational''':  Goal 1.

'''Category''':  User experience.

'''Requestor''':  Development

'''Depends On''':  Some types of checking will require [#Metadata Metadata], others could
be accomplished without it.

'''Current Situation''':  Currently a number of errors are not detected until runtime.

'''Estimated Development Effort''':  Small.

'''Urgency''':  Medium.

[[Anchor(Error_Handling)]]
=== Error Handling ===
'''Explanation''':    Pig needs to divide errors into categories of:
   * Warning:  Something went wrong but it is not a significant enough issue to stop execution,
though a particular data value may be lost (and replaced by a NULL).  For example, would be
used in a divide by 0 error.
   * Error:  Something went wrong and the current task will be aborted, but the query as a
whole should continue and the task should be retried.  For example, a resource was temporarily
unavaible. 
   * Fatal:  Something went wrong and there is no reason to attempt to finish executing the
query.  For example, the requested data file cannot be found.

'''Rational''':  Goal 1.

'''Category''':  Production quality.

'''Requestor''':  Development.

'''Depends On''':  [#NULL_Support NULL Support]

'''Current Situation''':  Most errors encountered during map reduce execution cause the system
to immediately abandon the query (e.g. a divide by zero
error in a query).

'''Estimated Development Effort''':  Medium.

'''Urgency''':  Medium.

[[Anchor(Explain_Plan_Support)]]
=== Explain Plan Support ===
'''Explanation''':  Users want to be able to see how their query will be executed.  This helps
them understand how changes in their queries will effect the execution.

'''Rational''':  Goal 2.

'''Category''':  User experience.

'''Requestor''':  Users

'''Depends On''':  

'''Current Situation''':  No explanation is currently available for query execution.  The
user can read the output of pig to see how many times map reduce was invoked,
but even this does not make clear which operations are being done in each step.

'''Estimated Development Effort''':  Small.

'''Urgency''':  Low.

[[Anchor(Removing_Automatic_String_Encoding)]]
=== Removing Automatic String Encoding ===
'''Explanation''':  Currently all data that pig reads from the grid is assumed to be strings
encoded in UTF8.  However, some of the data stored in the grid,
 is not 100% UTF8.  Furthermore, the encoding is not provided within the data, and the strings
in question
are too short to use inductive algorithms to determine the encoding.  Pig (and any other user)
therefore has no way to determine the encoding.  However,
since pig currently assumes the data is UTF8 encoded it corrupts the data by converting it
improperly.

The are two partial solutions to this.  One is to provide users with a C string like data
type that provides some basic operators (equality, less than,
maybe very limited regular expressions) that does not attempt any intepretation or translation
of the underlying data.  This will be done as part of the
types work to create a byte data type.   The second solution is, even for full string types,
to only translate the underlying data type to java
strings when it is actually required, instead of by default.  This should be done for performance
reasons, as translation is expensive.

'''Rational''':  Goal 1.

'''Category''':  Production quality.

'''Requestor''':  Development

'''Depends On''':  [#Types_Beyond_String Types Beyond String]

'''Current Situation''':  See Explanation.

'''Estimated Development Effort''':  Small, as most of the work will be done in "Types Beyond
String".

'''Urgency''':  Medium.

[[Anchor(Language_Expansion)]]
=== Language Expansion ===
'''Explanation''':  Users have requested a number of extensions to the language.  Some of
these are new functionality, some just extensions of exsiting funcitonality.
Requested extensions are:
   1. CASE statement.  This would generalize the currently available binary condition operator
(`? :`).
   2. Extend operators allowed inside `FOREACH { ... GENERATE }`.  Users have requested that
GROUP, COGROUP, and JOIN be added.
   3. Allow user to specify that a sort should be done in descending order instead of the
default ascending order.  This should be extended to the SQL functionality of specifying ascending
or descending for each field, e.g. `ORDER a BY $0 ASCENDING, $1 DESCENDING`.
   5. Add LIMIT functionality.  

'''Rational''':  Goal 1.

'''Category''':  User experience.

'''Requestor''':  Users

'''Depends On''':  

'''Current Situation''':  None of the above requested extensions currently exist.

'''Estimated Development Effort''':  Medium for all of the above, but note that any one of
the above could be added without the others.  Taken alone, each of the above
are Small.

'''Urgency''':  Low.

[[Anchor(Logging)]]
=== Logging ===
'''Explanation''':  Pig needs a consistent, configurable way to write query progress, debugging,
and error information both to the user and to logs.

'''Rational''':  Goal 1.

'''Category''':  User experience and Engineering manageability.

'''Requestor''':  Development.

'''Depends On''':  

'''Current Situation''':  Most pig progress, debugging, and error information is currently
written to stderr via System.err.println.  Hadoop uses log4j
to provide trace logging.  Recently (as of pig 1.1d), pig began to use log4j as well.  However,
very few parts of the code have been converted from
println to log4j.  In addition, pig only has a screen appender for log4j, so it only writes
to stderr.  Pig needs to be generating a log file on the
front end as well.

'''Estimated Development Effort''':  Medium.

'''Urgency''':  Medium.

[[Anchor(Metadata)]]
=== Metadata ===
'''Explanation''':  Pig needs metadata to do the following:
   1. Provide a way for users to understand the data they want to query.  For example, users
should be able to say `DESCRIBE '/user/me/mydata'` and get back a list of (minimally) fields
and their types.
   2. Do type checking to find situations where users issue queries that are not semantically
meaningful (such as dividing by a string).
   3. Allow performance optimizations such as:
      * performing arithmetic operations (e.g. SUM) as a long if the underlying type is an
int or long, instead of as a double.
      * allowing pig to load and store numeric types as numerics rather than requiring a conversion
from string.
   4. Provide a way to describe to a user defined function the format of the data being passed
to the function.
   5. Allow sorting via numeric in addition to lexical order.

The decision has been made to use the Jute interface, provided by hadoop, to describe metadata.
 This will allow pig to interact with other hadoop tools.  It will
also free the pig team from needing to development their own metadata management library.

The goal is not to require metadata in pig.  Pig will retain the flexibility to work on unspecified
input data.

It remains an open question on how pig should handle the case where the data it is provided
does not match the metadata specification.  The default should be to
produce a warning for each non-conforming row.  It remains to be determined if there is a
use for a "strict" mode where a non-conforming row would cause a query
stopping failure.

Note that this metadata only refers to metadata local to a file.  It does not
refer to global metadata such as table names, available UDFs, etc.

'''Rational''':  Goals 2, 3, 4.

'''Category''':  Infrastructure.

'''Requestor''':  Everyone.

'''Depends On''':  [#Types_Beyond_String Types Beyond String], some changes to Jute.

'''Current Situation''':  No metadata is available concerning data stored in the grid.

'''Estimated Development Effort''':  Large.

'''Urgency''':  High.

[[Anchor(NULL_Support)]]
=== NULL Support ===
'''Explanation''':  Pig needs to be able to support NULLs in its data for the following reasons:
   1. Some of its input data has NULL values in it, and users want to be able to act on this
NULL data (e.g. filter it out via IS NOT NULL).
   2. Function and expression evaluations sometimes are unable to return a value, but execution
of the query should not stop (e.g. divide by zero error).  In this case the evaluation needs
to return a NULL value to place in the field.

This requires that Jute support NULL values in data stored in its format.

This will require additions to the language, namely the ability to filter on IS (NOT) NULL.
 It will also require that user defined functions and expression
evaluators determine how they will
interact with NULLs.  Functions that are similar to SQL functions (such as COUNT, SUM, etc.)
and arithmetic expression operators should behave in a way consistent
with SQL standards (even though SQL standards themselves are inconsistent) in order to avoid
violating the law of least astonishment.

'''Rational''':  Goal 1.

'''Category''':  Infrastructure.

'''Requestor''':  Development.

'''Depends On''':  Jute NULL support.

'''Current Situation''':  There is no concept of NULL in pig at this time.

'''Estimated Development Effort''':  Medium.

'''Urgency''':  High.

[[Anchor(Parameterized_Queries)]]
=== Parameterized Queries ===
'''Explanation''':  Users would like the ability to define parameters in a pig query.  When
the query is invoked, values for those parameters would be defined and they
would then be used in execution of the query.  For example:
{{{
a = load '/data/mydata/@date@';
b = load '@latest_bot_filter@';
...
}}}

The query above could then be invoked, providing values for 'date' and 'latest_bot_filter'.

'''Rational''':  Goal 1.

'''Category''':  User experience.

'''Requestor''':  Users

'''Depends On''':  

'''Current Situation''':  No support for this is available.

'''Estimated Development Effort''':  Small.

'''Urgency''':  Low.

[[Anchor(Performance)]]
=== Performance ===
'''Explanation''':  Pig needs to run, as close as is feesible in a generic language, near
the same performance that could be obtained by a programmer
developing an application directly in map reduce.  There are multiple possible routes to take
in performance enhancement:
   1. Query optimization via relational operator reordering and substitution.  For example,
pushing filters as far up the execution plan as possible, pushing expression evaluation as
late as possible, etc.
   2. In some instances it is possible to execute multiple relational operators in the same
map or reduce.  In some cases pig is already doing this (e.g. multiple filters are pipelined
into one map operation).  There are additional cases that pig is not taking advantage of that
it could (for example if a join is followed immediately by an aggregation on the same key,
both could be placed in the same reduce).
   4. After the map stage, the map reduce system sorts the data according to keys specified
by the job requester.  This key comparison is done via a class specified by the job requester.
 Currently, this class is instantiated on every key comparison.  This means that the constructor
for this class is called at least nlogn times.  Hadoop provides a byte comparitor option where
the sequence of bytes in the key are compared directly rather than through an instantiated
class.  Whereever possible pig needs to make use of this comparitor.
   5. Hadoop provides the ability to run a mini-reduce stage (called the combine stage) after
the map and before data is shuffled to new processes on the reduce.  For certain algebraic
functions (such as SUM and COUNT) the amount of data to be shipped between map and reduce
could be reduced (sometimes greatly) by running the reduce step first in the combine, and
then again in reduce.
   6. The code as it exists today can be instrumented and tests done to determine where the
most time is spent.  This information can then be used to optimize heavily used areas of the
code that are taking significant amounts of time.
   7. Split performance is very poor, and needs to be optimized.
   8. Currently, given a line of pig like `foreach A generate group, COUNT($0), SUM($1)`,
pig will run over `A` twice, once for `COUNT` and once for `SUM`.  It should be able to compute
both in a single pass.

'''Rational''':  Goal 3.

'''Category''':  Performance.

'''Requestor''':  Everyone.

'''Depends On''':  [#Metadata Metadata] for some optimizations

'''Current Situation''':  See Explanation.

'''Estimated Development Effort''':  
   1. Large
   2. Medium
   4. Small
   5. Small
   6. Small
   7. ?
   8. ?

'''Urgency''':  High for 4, 5, and 6, Medium for the rest.

[[Anchor(Query_Optimization)]]
=== Query Optimization ===
See [#Performance Performance]

[[Anchor(Shell_Support)]]
=== Shell Support ===
'''Explanation''':  There are three main goals of a pig shell:
   1. The ability to interact in a shell with the HDFS.
   2. The ability to interact in a shell with the user's machine.  This would allow the user
to edit local files, etc.
   3. The ability to write pig scripts in an interactive shell (similar to perl or python
shells).

Items 1 and 2 are similar, the only difference being which file system and OS they interact
with.  Item 3 is something entirely different, though both types
of items are called shells.  Item 3 is an interactive programming environment.  For this reason
I do not think it makes sense to attempt to combine the
two shells.  Instead I propose that two shells be developed:

'''hsh''' (hadoop shell) - This shell will provide shell level interaction with HDFS.  In
addition to the current operators of cat, cp, kill, ls, mkdir, mv,
pwd, rm it may be necessary to add chmod (based on hadoop implementation of permissions),
chown and chgrp (based on hadoop implementation of ownership),
df, ds, ln (if hadoop someday supports links), and touch.  This shell will also support access
to the user's local machine and all of the available shell
commands.  Ideally the shell would even support applying standard shell file tools that access
files linearly (grep, more, etc.) to HDFS files, but this
may be a later addition.  Separate commands would not be required for cp, rm etc. based on
which file system.  The shell should be able to determine that
based on the path.  To accomplish this, HDFS files would have /hdfs/CLUSTER (where CLUSTER
is the name of the cluster, e.g. kryptonite) as the base of
their path.  hsh would then need to intercept shell commands such as rm, and determine whether
to execute in HDFS or the local file system, based on the
file being manipulated.  For example, if a user wanted to create a file in his local directory
on the gateway machine and then copy it to
the HDFS, he could then do:
{{{
hsh> vi myfile
hsh> cp myfile /hdfs/kryptonite/user/me/myfile
}}}
The invocation of vi would call vi on the local machine and edit a file in the current working
directory on his local box.  The copy would move the file
from the user's local directory on the gateway machine to his directory on the HDFS.

hsh will support standard command line editing (and ideally tab completion).  This should
be done via third party library such as jline.

'''grunt''' - This shell will provide interactive pig scripting, similar to the way it does
today.  It will no longer support HDFS file manipulation commands
(such as mv, rm, etc.).  It will also be extended to support inline scripting in either perl
or python.  (Which one should be used is not clear.  The
research team votes for python.  User input is needed to decide if more users would prefer
python or perl.) This will enable two important features:
   1. Embedding pig in a scripting language to give it control flow and branching structures.
   2. Allow on the fly development of user defined functions by creating a function native
to the scripting language and then referencing it in the pig script.

The grunt shell will be accessible from hsh by typing `grunt`.  This will put the user in
the interactive shell, in much the same way that typing `python`
on a unix box puts the user in an interactive python development shell.

'''Rational''':  Goals 4 and 6.

'''Category''':  Infrastructure.

'''Requestor''':  Development, Users

'''Depends On''':  

'''Current Situation''':  Currently grunt provides very limited commands (ls, mkdir, mv, cp,
cat, rm, kill) for the HDFS.  It also provides an interactive
shell for generating pig scripts on the fly (similar to perl or python interactive shells).
 No command line history or editing are provided.  No connection
with the users local file system is supported.

'''Estimated Development Effort''':  Large.

'''Urgency''':  Medium.

[[Anchor(SQL_Support)]]
=== SQL Support ===
'''Explanation''':  There is a large base of SQL users.   Many of these users will prefer
to query data in SQL rather than learn pig.  To accomodate these users and
increase the adoption of hadoop, we could provide a SQL layer on top of pig.  From pig's perspective,
the easiest way to address this it to directly translate SQL
to pig, and then execute the resulting pig script.

'''Rational''':  Goal 4.

'''Category''':  User experience.

'''Requestor''':  Users

'''Depends On''':  [#Metadata Metadata]

'''Current Situation''':  See Explanation.

'''Estimated Development Effort''':  Large.

'''Urgency''':  Medium.


[[Anchor(Statistics_on_Pig_Usage)]]
=== Statistics on Pig Usage ===
'''Explanation''':  Pig should record statistics on its usage so that pig developers and grid
administrators can monitor and understand pig usage.  These
statistics need to be collected in a way that does not compromise the security of data that
is being queried (i.e. they cannot store results of the
queries).  They should however contain: 
   * files loaded
   * files created
   * queries submitted (including the text of the queries)
   * time the query took to execute 
   * number of rows input to the query
   * number of rows output by the query
   * size of intermediate results created by the query
   * error and warning count
   * error and warning types and messages
   * status result of the query (did it succeed, fail, was interupted, etc.)
   * user who executed the query.
   * operations performed by the query (e.g. did the query include a filter, a group, etc.)
 This can be used by developers to search for all queries that include an aggregation, etc.

For security reasons, these statitics will need to be kept inside the grid they are generated
on, and only be accessable to cluster administrators and
pig developers.

'''Rational''':  Goal 7.  This also assists developers in meeting goals 1 and 3 because it
allows them to determine the quality and performance of the user
experience.

'''Category''':  Engineering manageability.

'''Requestor''':  Administrators

'''Depends On''':  

'''Current Situation''':  There are no usage statistics available.

'''Estimated Development Effort''':  Medium.

'''Urgency''':  Medium.

[[Anchor(Stream_Support)]]
=== Stream Support ===
'''Explanation''':  Hadoop supports running HDFS files through an arbitrary executable, such
as grep or a user provided program.  This is referred to as
streaming.  There exists already a base of user programs used in this way.  Users have expressed
an interest in integrating these with pig so that they
can take advantage of pig's parallel structure and relational operators but still use their
hand crafted programs to express their business logic.  For
example (note:  syntax and semantics still TBD)
{{{
a = load '/user/me/mydata';
b = filter a by $0 matches "^[a-zA-Z]\.yahoo.com";
c = stream b through countProperties;
...
}}}

'''Rational''':  Goal 4.  

'''Category''':  Infrastructure.

'''Requestor''': Users

'''Depends On''':  

'''Current Situation''':  No support for streaming is currently available.

'''Estimated Development Effort''':  Medium.

'''Urgency''':  Medium (while it has only been requested by a couple of users to date, we
believe it will open up pig usage to a number of users and
therefore is more desirable).

[[Anchor(Test_Framework)]]
=== Test Framework ===
'''Explanation''':  Data querying systems like pig have unique functional testing requirements.
 For unit tests, junit is an excellent tool.  But for
function tests, as arbitrary queries of potentially large amounts of data need to be
added on a regular basis it is not feasible to use a testing system like junit that assumes
the test implementer knows the correct result for the test
when the test is implemented.  We need a tool that will allow the 
tester to designate a source of truth for the test, and then generate the expected results
for that test from that source of truth.  For example,
for small functional tests a database could be set up with the same data as in the grid. 
The tester would then write a pig query and an equivalent SQL query, and
the test harness would run both and compare the results.

In addition to the above, a set of performance tests need to be created to allow developers
to test pig performance.  It should be possible to develop
these in the same framework as is used for the functional tests.

'''Rational''':  Goal 1.

'''Category''':  Engineering manageability.

'''Requestor''':  Development.

'''Depends On''':  

'''Current Situation''':  Pig has unit tests in junit.

'''Estimated Development Effort''':  Medium.

'''Urgency''':  High (quality testing that is easy for developers to implements facilitates
better and faster development of features).

[[Anchor(Test_Integration_with_Hudson)]]
=== Test Integration with Hudson ===
'''Explanation''':  Currently hadoop uses hudson as its nightly build environment.  This includes
checking out, building, and
running unit tests.  It is also used to test new patches that are submitted to jira.  Pig
needs to be integrated with hudson in this same way.

Also, there are a set of tools that are regularly applied to other hadoop nightly builds (Findbugs
and code coverage metrics) via hudson.
These need to be applied to pig as well.

'''Rational''':  Goal 1.

'''Category''':  Engineering manageability.

'''Requestor''':  QA.

'''Depends On''':  

'''Current Situation''':  See Explanation

'''Estimated Development Effort''':  Medium.

'''Urgency''':  High (frees committers from manually testing patches).

[[Anchor(Types_Beyond_String)]]
=== Types Beyond String ===
'''Explanation''':  For performance and usability reasons pig needs to support atomic data
types beyond strings.  Based on data types supported in jute and
standard data types supported by most data processing systems, these types will be:
   * int (32 bit integer)
   * long (64 bit integer)
   * float (32 bit floating point number)
   * double (64 bit floating point number)
   * string (already supported)
   * byte (binary data) - Should this be called cstring, or bstring or something?  We want
to provide some basic operators for it, so it isn't simply a blob or binary type.  But it
isn't a full fledged java string either.
   * user defined types

Supporting these types in a native format will allow performance gains in reading and writing
data to disk and in key comparison for sorting, grouping,
etc.  It will also avoid requiring string->number conversions for every row during numeric
expression evaluation.  It will also allow sorting in numeric
order rather than requiring that all sorts be in lexical order, as is currently the case.

User defined types will require the user to provide a set of functions to:
   * load and store the object
   * convert the object to a java string.

Optionally, a user could provide functions that do
   * less than inequality comparison - this would be required to do sorting and filter evaluations
   * for optimization, the full set of less than equal to, equal, not equal, greater than,
greater than equal
   * matches (regular expression)
   * arithmetic operatons on the types (`+ - * /`)

'''Rational''':  Goals 1 and 3.

'''Category''':  Infrastructure.

'''Requestor''':  Development.

'''Depends On''': 

'''Current Situation''':  A functional spec for changes to types has been proposed, see http://wiki.apache.org/pig/PigTypesFunctionalSpec

'''Estimated Development Effort''':  Large.

'''Urgency''':  High

[[Anchor(User_Defined_Function_Support_in_Other_Language)]]
=== User Defined Function Support in Other Language ===
'''Explanation''':  Many potential pig users are not java programmers.  Many prefer to program
in C++, perl, python, etc. 
Pig needs to allow users to program in their language of preference. 

This requires that pig provide APIs so that users can write code in these other languages
and use it as part of their query in a way similar to what is
done with java functions today.  Java functions can be implemented as eval functions, filter
functions, or load and storage functions.  At this point
there is not a perceived need for load and storage functions in languages beyond java.  But
eval and filter functions should be supported.

Perl, python, and C++ have been chosen as the languages to be supported because those are
the languages most used by the user community at this time.

'''Rational''':  Goal 4.

'''Category''':  Infrastructure.

'''Requestor''':  Everyone.

'''Depends On''':  

'''Current Situation''':  Pig currently supports user defined functions in java.

'''Estimated Development Effort''':  Medium.

'''Urgency''':  High.

-----
-----
[[Anchor(Addenda)]]
== Addenda  ==
The following thoughts were added after the roadmap had been published and reviewed.

[[Anchor(Execution_that_supports_restart)]]
=== Execution that supports restart ===

We discussed the need for something that takes a Pig execution plan and executes it in a way
the user can easily observe and comprehend.  If a job breaks the user
gets a HDFS directory with all of the completed sub-job state and enough info that the user
can repair his Pig script or the data and restart.  This is needed for
really long jobs.

[[Anchor(Global_Metadata)]]
=== Global Metadata ===
Need to consider where/how/if pig will store global metadata.  Items that
would go here would include table->file mapping (to support such things as
show tables), UDFs available on the cluster, UDTs available on the cluster,
etc.

Mime
View raw message