hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "pi song" <pi.so...@gmail.com>
Subject Re: [Pig Wiki] Update of "PigMetaData" by AlanGates
Date Sat, 31 May 2008 00:18:37 GMT
I love discussing about new idea, Mathieu. This is not bothering but
interesting. My colleague had spent sometime doing a Microsoft SSIS thing
that always breaks once the is a schema change and requires a manual script
change. Seems like you are trying to go beyond that.

On Sat, May 31, 2008 at 12:53 AM, Mathieu Poumeyrol <poumeyrol@idm.fr>
wrote:

> Well, it adds a way to *dynamically* parameterize UDF, without changing the
> pig script itself.
>
> I guess it comes back to the questions about "how big a pig script is". If
> we are only considering 5-line pig scripts, where you do load exactly what
> you need to compute, crush numbers and dump them, I agree it does not make
> much sense.
>
> If one start thinking about something more ETL-ish (which I understand is
> not exactly the main purpose of pig) then one could want to use pig to
> "move" data around or load data from somewhere, do something "heavy" that
> ETL software can just not cope with efficiently enough —build index, process
> images, whatever — and store the results somewhere else, a scenario where
> there can be fields that pig will just forward, without playing with.
>
> I admit my background where we were using the same software for ETL-like
> stuff and heavy processing (that is, mostly building index) may give me very
> a biaised opinion about pig and what it should be. But I would definitely
> like to use pig for what it is/will be excellent for, as well as for stuff
> where it will be just ok.
>
> So I still think the extension point is worth having. Half my brain is
> already thinking about way of cheating and using Alan's fields list to pass
> other stuff around...
>
> Another concrete example and I stop bothering you all, then :) In our
> tools, we are using some field metadata to denote that a field content is a
> primary key to a record. When we copy this field values to somewhere else,
> we automaticaly tag them as foreign key (instead of primary). When we dump
> the data on disk (to a final-user CDROM image in most cases) the fact that
> the column refers to a table present on the disk too can be automagically
> stored as it is a feature of our final format : without having the
> application developper re-specifying the relations, the "UDF store
> equivalent" is clever enough to store the information.
>
> The script the application developper who prepare a CDROM can be several
> screen long, with bits spread on separate files. The data model could be
> quite complex too. In this context, it is important that things like "this
> field acts as a record key" are said once.
>
> Le 30 mai 08 à 16:13, pi song a écrit :
>
>
>  More,  adding meta data is conceptually adding another way to parameterize
>> load/store functions. Making UDFs to be parameterized by other UDFs
>> therefore is also possible functionally but I just couldn't think of any
>> good use cases.
>>
>> On Sat, May 31, 2008 at 12:09 AM, pi song <pi.songs@gmail.com> wrote:
>>
>>  Just out of curiosity. If you say somehow the UDF store in your example
>>> can
>>> "learn" from UDF load. That information still might not be useful because
>>> between "load" and "store", you've got processing logic which might or
>>> might
>>> not alter the validity of information directly transfered from "load" to
>>> "store". An example would be I do load a list of number and then I
>>> convert
>>> to string. Then information on the UDF store side is then not applicable.
>>>
>>> Don't you think the cases where this concept can be useful is very rare?
>>>
>>> Pi
>>>
>>>
>>>
>>> On Fri, May 30, 2008 at 11:44 PM, Mathieu Poumeyrol <poumeyrol@idm.fr>
>>> wrote:
>>>
>>>  Pi,
>>>>
>>>> Well... I was thinking... the three of them actually. Alan's list is
>>>> quite
>>>> comprehensive, so it is not that easy to find a counvincing example, but
>>>> I'm
>>>> sure UDF developper may need some additional information to communicate
>>>> metadata from one UDF to another.
>>>>
>>>> It does not make sense if you think "one UDF function", but it is a way
>>>> to
>>>> have two coordinated UDF communicating.
>>>>
>>>> For instance the developper of a jdbc pig "connector" will typically
>>>> write
>>>> a UDF load, and a UDF store. What if he wants the loader to discover the
>>>> field collection (case 3, Self describing data in Alan's page) from jdbc
>>>> and
>>>> propagate the exact column type of a given field (as in "VARCHAR(42)"),
>>>> to
>>>> create it the right way in the UDF store ? or the table name ? or the
>>>> fact
>>>> that a column is indexed, a primary key, a foreign key constraint, some
>>>> encoding info... He may also want to develop a UDF pipeline function
>>>> that
>>>> would perform some foreign key validation against the database at some
>>>> point
>>>> in his script. Having the information in the metadata may be usefull.
>>>>
>>>> Some other fields of application we can not think of today may need some
>>>> completely different metadata. My whole point is: Pig should provide
>>>> some
>>>> metadata extension point.
>>>>
>>>> Le 30 mai 08 à 13:54, pi song a écrit :
>>>>
>>>>
>>>> I don't get it Mathieu.  UDF is a very broad term. It could be UDF Load,
>>>>
>>>>> UDF
>>>>> Store, or UDF as function in pipeline.  Can you explain a bit more?
>>>>>
>>>>> On Fri, May 30, 2008 at 9:14 PM, Mathieu Poumeyrol <poumeyrol@idm.fr>
>>>>> wrote:
>>>>>
>>>>> All,
>>>>>
>>>>>>
>>>>>> Looking at the very extensive list of types of file specificic
>>>>>> metadata,
>>>>>> I
>>>>>> think (from experience) that a UDF function may need to attach some
>>>>>> information (any information, actualy) to a given field (or file)
to
>>>>>> be
>>>>>> retrieved by another UDF downstream.
>>>>>>
>>>>>> What about adding a Map<String, Serializable> to each file
and each
>>>>>> field ?
>>>>>>
>>>>>> --
>>>>>> Mathieu
>>>>>>
>>>>>> Le 30 mai 08 à 01:24, pi song a écrit :
>>>>>>
>>>>>>
>>>>>> Alan,
>>>>>>
>>>>>>
>>>>>>> I will start thinking about this as well. When do you want to
start
>>>>>>> the
>>>>>>> implementation?
>>>>>>>
>>>>>>> Pi
>>>>>>>
>>>>>>> On 5/29/08, Apache Wiki <wikidiffs@apache.org> wrote:
>>>>>>>
>>>>>>>
>>>>>>>  Dear Wiki user,
>>>>>>>>
>>>>>>>> You have subscribed to a wiki page or wiki category on "Pig
Wiki"
>>>>>>>> for
>>>>>>>> change notification.
>>>>>>>>
>>>>>>>> The following page has been changed by AlanGates:
>>>>>>>> http://wiki.apache.org/pig/PigMetaData
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>> information, histograms, etc.
>>>>>>>>
>>>>>>>> == Pig Interface to File Specific Metadata ==
>>>>>>>> - Pig should support four options with regard to file specific
>>>>>>>> metadata:
>>>>>>>> + Pig should support four options with regard to reading
file
>>>>>>>> specific
>>>>>>>> metadata:
>>>>>>>> 1.  No file specific metadata available.  Pig uses the file
as input
>>>>>>>> with
>>>>>>>> no knowledge of its content.  All data is assumed to be !ByteArrays.
>>>>>>>> 2.  User provides schema in the script.  For example, `A
= load
>>>>>>>> 'myfile'
>>>>>>>> as (a: chararray, b: int);`.
>>>>>>>> 3.  Self describing data.  Data may be in a format that describes
>>>>>>>> the
>>>>>>>> schema, such as JSON.  Users may also have other proprietary
ways to
>>>>>>>> store
>>>>>>>> information about the data in a file either in the file itself
or in
>>>>>>>> an
>>>>>>>> associated file.  Changes to the !LoadFunc interface made
as part of
>>>>>>>> the
>>>>>>>> pipeline rework support this for data type and column layout
only.
>>>>>>>>  It
>>>>>>>> will
>>>>>>>> need to be expanded to support other types of information
about the
>>>>>>>> file.
>>>>>>>> 4.  Input from a data catalog.  Pig needs to be able to query
an
>>>>>>>> external
>>>>>>>> data catalog to acquire information about a file.  All the
same
>>>>>>>> information
>>>>>>>> available in option 3 should be available via this interface.
 This
>>>>>>>> interface does not yet exist and needs to be designed.
>>>>>>>>
>>>>>>>> + It should support options 3 and 4 for writing file specific
>>>>>>>> metadata
>>>>>>>> as
>>>>>>>> well.
>>>>>>>> +
>>>>>>>> == Pig Interface to Global Metadata ==
>>>>>>>> - An interface will need to be designed for pig to interface
to an
>>>>>>>> external
>>>>>>>> data catalog.
>>>>>>>> + An interface will need to be designed for pig to read from
and
>>>>>>>> write
>>>>>>>> to
>>>>>>>> an external data catalog.
>>>>>>>>
>>>>>>>> == Architecture of Pig Interface to External Data Catalog
==
>>>>>>>> Pig needs to be able to connect to various types of external
data
>>>>>>>> catalogs
>>>>>>>> (databases, catalogs stored in flat files, web services,
etc.).  To
>>>>>>>> facilitate this
>>>>>>>> - pig will develop a generic interface that allows it to
make
>>>>>>>> specific
>>>>>>>> types of queries to a data catalog.  Drivers will then need
to be
>>>>>>>> written
>>>>>>>> to
>>>>>>>> implement
>>>>>>>> + pig will develop a generic interface that allows it to
query and
>>>>>>>> update
>>>>>>>> a
>>>>>>>> data catalog.  Drivers will then need to be written to implement
>>>>>>>> that interface and connect to a specific type of data catalog.
>>>>>>>>
>>>>>>>> == Types of File Specific Metadata Pig Will Use ==
>>>>>>>> - Pig should be able to acquire the following types of information
>>>>>>>> about
>>>>>>>> a
>>>>>>>> file via either self description or an external data catalog.
 This
>>>>>>>> is
>>>>>>>> not
>>>>>>>> to say
>>>>>>>> + Pig should be able to acquire and record the following
types of
>>>>>>>> information about a file via either self description or an
external
>>>>>>>> data
>>>>>>>> catalog.  This is not to say
>>>>>>>> that every self describing file or external data catalog
must
>>>>>>>> support
>>>>>>>> every
>>>>>>>> one of these items.  This is a list of items pig may find
useful and
>>>>>>>> should
>>>>>>>> be
>>>>>>>> - able to query for.  If the metadata source cannot provide
the
>>>>>>>> information, pig will simply not make use of it.
>>>>>>>> + able to query for and create.  If the metadata source cannot
>>>>>>>> provide
>>>>>>>> or
>>>>>>>> store the information, pig will simply not make use of it
or record
>>>>>>>> it.
>>>>>>>> * Field layout (already supported)
>>>>>>>> * Field types (already supported)
>>>>>>>> * Sortedness of the data, both key and direction
>>>>>>>> (ascending/descending)
>>>>>>>> @@ -52, +54 @@
>>>>>>>>
>>>>>>>>
>>>>>>>> == Priorities ==
>>>>>>>> Given that the usage for global metadata is unclear, the
priority
>>>>>>>> will
>>>>>>>> be
>>>>>>>> placed on supporting file specific metadata.  The first step
should
>>>>>>>> be
>>>>>>>> to
>>>>>>>> define the
>>>>>>>> - interface changes in !LoadFunc and the interface to external
data
>>>>>>>> catalogs.
>>>>>>>> + interface changes in !LoadFunc, !StoreFunc and the interface
to
>>>>>>>> external
>>>>>>>> data catalogs.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message