hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Ryaboy <dvrya...@cloudera.com>
Subject Re: requirements for Pig 1.0?
Date Wed, 24 Jun 2009 17:18:45 GMT
Alan, any thoughts on performance baselines and benchmarks?

I am a little surprised that you think SQL is a requirement for 1.0, since
it's essentially an overlay, not core functionality.

What about the storage layer rewrite (or is that what you referred to with
your first bullet-point)?

Also, the subject of making more (or all) operators nestable within a
foreach comes up now and then.. would you consider this important for 1.0,
or something that can wait?

Integration with other languages (a-la PyPig)?

The Roadmap on the Wiki is still "as of Q3 2007".... makes it hard for an
outside contributor to know where to jump :-).

-D


On Wed, Jun 24, 2009 at 10:02 AM, Alan Gates <gates@yahoo-inc.com> wrote:

> Integration with Owl is something we want for 1.0.  I am hopeful that by
> Pig's 1.0 Owl will have flown the coop and become either a subproject or
> found a home in Hadoop's common, since it will hopefully be used by multiple
> other subprojects.
>
> Alan.
>
>
> On Jun 23, 2009, at 11:42 PM, Russell Jurney wrote:
>
>  For 1.0 - complete Owl?
>>
>> http://wiki.apache.org/pig/Metadata
>>
>> Russell Jurney
>> rjurney@cloudstenography.com
>>
>>
>> On Jun 23, 2009, at 4:40 PM, Alan Gates wrote:
>>
>>  I don't believe there's a solid list of want to haves for 1.0.  The big
>>> issue I see is that there are too many interfaces that are still shifting,
>>> such as:
>>>
>>> 1) Data input/output formats.  The way we do slicing (that is, user
>>> provided InputFormats) and the equivalent outputs aren't yet solid.  They
>>> are still too tied to load and store functions.  We need to break those out
>>> and understand how they will be expressed in the language. Related to this
>>> is the semantics of how Pig interacts with non-file based inputs and
>>> outputs.  We have a suggestion of moving to URLs, but we haven't finished
>>> test driving this to see if it will really be what we want.
>>>
>>> 2) The memory model.  While technically the choices we make on how to
>>> represent things in memory are internal, the reality is that these changes
>>> may affect the way we read and write tuples and bags, which in turn may
>>> affect our load, store, eval, and filter functions.
>>>
>>> 3) SQL.  We're working on introducing SQL soon, and it will take it a few
>>> releases to be fully baked.
>>>
>>> 4) Much better error messages.  In 0.2 our error messages made a leap
>>> forward, but before we can claim to be 1.0 I think they need to make 2 more
>>> leaps:  1) they need to be written in a way end users can understand them
>>> instead of in a way engineers can understand them, including having
>>> sufficient error documentation with suggested courses of action, etc.; 2)
>>> they need to be much better at tying errors back to where they happened in
>>> the script, right now if one of the MR jobs associated with a Pig Latin
>>> script fails there is no way to know what part of the script it is
>>> associated with.
>>>
>>> There are probably others, but those are the ones I can think of off the
>>> top of my head.  The summary from my viewpoint is we still have several 0.x
>>> releases before we're ready to consider 1.0.  It would be nice to be 1.0 not
>>> too long after Hadoop is, which still gives us at least 6-9 months.
>>>
>>> Alan.
>>>
>>>
>>> On Jun 22, 2009, at 10:58 AM, Dmitriy Ryaboy wrote:
>>>
>>>  I know there was some discussion of making the types release (0.2) a
>>>> "Pig 1"
>>>> release, but that got nixed. There wasn't a similar discussion on 0.3.
>>>> Has the list of want-to-haves for Pig 1.0 been discussed since?
>>>>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message