hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mathieu Poumeyrol <poumey...@idm.fr>
Subject Re: [Pig Wiki] Update of "PigMetaData" by AlanGates
Date Fri, 30 May 2008 14:53:29 GMT
Well, it adds a way to *dynamically* parameterize UDF, without  
changing the pig script itself.

I guess it comes back to the questions about "how big a pig script  
is". If we are only considering 5-line pig scripts, where you do load  
exactly what you need to compute, crush numbers and dump them, I agree  
it does not make much sense.

If one start thinking about something more ETL-ish (which I understand  
is not exactly the main purpose of pig) then one could want to use pig  
to "move" data around or load data from somewhere, do something  
"heavy" that ETL software can just not cope with efficiently enough — 
build index, process images, whatever — and store the results  
somewhere else, a scenario where there can be fields that pig will  
just forward, without playing with.

I admit my background where we were using the same software for ETL- 
like stuff and heavy processing (that is, mostly building index) may  
give me very a biaised opinion about pig and what it should be. But I  
would definitely like to use pig for what it is/will be excellent for,  
as well as for stuff where it will be just ok.

So I still think the extension point is worth having. Half my brain is  
already thinking about way of cheating and using Alan's fields list to  
pass other stuff around...

Another concrete example and I stop bothering you all, then :) In our  
tools, we are using some field metadata to denote that a field content  
is a primary key to a record. When we copy this field values to  
somewhere else, we automaticaly tag them as foreign key (instead of  
primary). When we dump the data on disk (to a final-user CDROM image  
in most cases) the fact that the column refers to a table present on  
the disk too can be automagically stored as it is a feature of our  
final format : without having the application developper re-specifying  
the relations, the "UDF store equivalent" is clever enough to store  
the information.

The script the application developper who prepare a CDROM can be  
several screen long, with bits spread on separate files. The data  
model could be quite complex too. In this context, it is important  
that things like "this field acts as a record key" are said once.

Le 30 mai 08 à 16:13, pi song a écrit :

> More,  adding meta data is conceptually adding another way to  
> parameterize
> load/store functions. Making UDFs to be parameterized by other UDFs
> therefore is also possible functionally but I just couldn't think of  
> any
> good use cases.
> On Sat, May 31, 2008 at 12:09 AM, pi song <pi.songs@gmail.com> wrote:
>> Just out of curiosity. If you say somehow the UDF store in your  
>> example can
>> "learn" from UDF load. That information still might not be useful  
>> because
>> between "load" and "store", you've got processing logic which might  
>> or might
>> not alter the validity of information directly transfered from  
>> "load" to
>> "store". An example would be I do load a list of number and then I  
>> convert
>> to string. Then information on the UDF store side is then not  
>> applicable.
>> Don't you think the cases where this concept can be useful is very  
>> rare?
>> Pi
>> On Fri, May 30, 2008 at 11:44 PM, Mathieu Poumeyrol  
>> <poumeyrol@idm.fr>
>> wrote:
>>> Pi,
>>> Well... I was thinking... the three of them actually. Alan's list  
>>> is quite
>>> comprehensive, so it is not that easy to find a counvincing  
>>> example, but I'm
>>> sure UDF developper may need some additional information to  
>>> communicate
>>> metadata from one UDF to another.
>>> It does not make sense if you think "one UDF function", but it is  
>>> a way to
>>> have two coordinated UDF communicating.
>>> For instance the developper of a jdbc pig "connector" will  
>>> typically write
>>> a UDF load, and a UDF store. What if he wants the loader to  
>>> discover the
>>> field collection (case 3, Self describing data in Alan's page)  
>>> from jdbc and
>>> propagate the exact column type of a given field (as in  
>>> "VARCHAR(42)"), to
>>> create it the right way in the UDF store ? or the table name ? or  
>>> the fact
>>> that a column is indexed, a primary key, a foreign key constraint,  
>>> some
>>> encoding info... He may also want to develop a UDF pipeline  
>>> function that
>>> would perform some foreign key validation against the database at  
>>> some point
>>> in his script. Having the information in the metadata may be  
>>> usefull.
>>> Some other fields of application we can not think of today may  
>>> need some
>>> completely different metadata. My whole point is: Pig should  
>>> provide some
>>> metadata extension point.
>>> Le 30 mai 08 à 13:54, pi song a écrit :
>>> I don't get it Mathieu.  UDF is a very broad term. It could be UDF  
>>> Load,
>>>> UDF
>>>> Store, or UDF as function in pipeline.  Can you explain a bit more?
>>>> On Fri, May 30, 2008 at 9:14 PM, Mathieu Poumeyrol <poumeyrol@idm.fr 
>>>> >
>>>> wrote:
>>>> All,
>>>>> Looking at the very extensive list of types of file specificic  
>>>>> metadata,
>>>>> I
>>>>> think (from experience) that a UDF function may need to attach  
>>>>> some
>>>>> information (any information, actualy) to a given field (or  
>>>>> file) to be
>>>>> retrieved by another UDF downstream.
>>>>> What about adding a Map<String, Serializable> to each file and
>>>>> each
>>>>> field ?
>>>>> --
>>>>> Mathieu
>>>>> Le 30 mai 08 à 01:24, pi song a écrit :
>>>>> Alan,
>>>>>> I will start thinking about this as well. When do you want to  
>>>>>> start the
>>>>>> implementation?
>>>>>> Pi
>>>>>> On 5/29/08, Apache Wiki <wikidiffs@apache.org> wrote:
>>>>>>> Dear Wiki user,
>>>>>>> You have subscribed to a wiki page or wiki category on "Pig 

>>>>>>> Wiki" for
>>>>>>> change notification.
>>>>>>> The following page has been changed by AlanGates:
>>>>>>> http://wiki.apache.org/pig/PigMetaData
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> information, histograms, etc.
>>>>>>> == Pig Interface to File Specific Metadata ==
>>>>>>> - Pig should support four options with regard to file specific
>>>>>>> metadata:
>>>>>>> + Pig should support four options with regard to reading file
>>>>>>> specific
>>>>>>> metadata:
>>>>>>> 1.  No file specific metadata available.  Pig uses the file as
>>>>>>> input
>>>>>>> with
>>>>>>> no knowledge of its content.  All data is assumed to be ! 
>>>>>>> ByteArrays.
>>>>>>> 2.  User provides schema in the script.  For example, `A = load
>>>>>>> 'myfile'
>>>>>>> as (a: chararray, b: int);`.
>>>>>>> 3.  Self describing data.  Data may be in a format that  
>>>>>>> describes the
>>>>>>> schema, such as JSON.  Users may also have other proprietary
>>>>>>> ways to
>>>>>>> store
>>>>>>> information about the data in a file either in the file itself
>>>>>>> or in
>>>>>>> an
>>>>>>> associated file.  Changes to the !LoadFunc interface made as
>>>>>>> part of
>>>>>>> the
>>>>>>> pipeline rework support this for data type and column layout
>>>>>>> only.  It
>>>>>>> will
>>>>>>> need to be expanded to support other types of information  
>>>>>>> about the
>>>>>>> file.
>>>>>>> 4.  Input from a data catalog.  Pig needs to be able to query
>>>>>>> external
>>>>>>> data catalog to acquire information about a file.  All the same
>>>>>>> information
>>>>>>> available in option 3 should be available via this interface.
>>>>>>> This
>>>>>>> interface does not yet exist and needs to be designed.
>>>>>>> + It should support options 3 and 4 for writing file specific
>>>>>>> metadata
>>>>>>> as
>>>>>>> well.
>>>>>>> +
>>>>>>> == Pig Interface to Global Metadata ==
>>>>>>> - An interface will need to be designed for pig to interface
>>>>>>> to an
>>>>>>> external
>>>>>>> data catalog.
>>>>>>> + An interface will need to be designed for pig to read from
>>>>>>> and write
>>>>>>> to
>>>>>>> an external data catalog.
>>>>>>> == Architecture of Pig Interface to External Data Catalog ==
>>>>>>> Pig needs to be able to connect to various types of external
>>>>>>> data
>>>>>>> catalogs
>>>>>>> (databases, catalogs stored in flat files, web services,  
>>>>>>> etc.).  To
>>>>>>> facilitate this
>>>>>>> - pig will develop a generic interface that allows it to make
>>>>>>> specific
>>>>>>> types of queries to a data catalog.  Drivers will then need to
>>>>>>> be
>>>>>>> written
>>>>>>> to
>>>>>>> implement
>>>>>>> + pig will develop a generic interface that allows it to query
>>>>>>> and
>>>>>>> update
>>>>>>> a
>>>>>>> data catalog.  Drivers will then need to be written to implement
>>>>>>> that interface and connect to a specific type of data catalog.
>>>>>>> == Types of File Specific Metadata Pig Will Use ==
>>>>>>> - Pig should be able to acquire the following types of  
>>>>>>> information
>>>>>>> about
>>>>>>> a
>>>>>>> file via either self description or an external data catalog.
>>>>>>> This is
>>>>>>> not
>>>>>>> to say
>>>>>>> + Pig should be able to acquire and record the following types
>>>>>>> of
>>>>>>> information about a file via either self description or an  
>>>>>>> external
>>>>>>> data
>>>>>>> catalog.  This is not to say
>>>>>>> that every self describing file or external data catalog must
>>>>>>> support
>>>>>>> every
>>>>>>> one of these items.  This is a list of items pig may find  
>>>>>>> useful and
>>>>>>> should
>>>>>>> be
>>>>>>> - able to query for.  If the metadata source cannot provide the
>>>>>>> information, pig will simply not make use of it.
>>>>>>> + able to query for and create.  If the metadata source cannot
>>>>>>> provide
>>>>>>> or
>>>>>>> store the information, pig will simply not make use of it or
>>>>>>> record
>>>>>>> it.
>>>>>>> * Field layout (already supported)
>>>>>>> * Field types (already supported)
>>>>>>> * Sortedness of the data, both key and direction
>>>>>>> (ascending/descending)
>>>>>>> @@ -52, +54 @@
>>>>>>> == Priorities ==
>>>>>>> Given that the usage for global metadata is unclear, the  
>>>>>>> priority will
>>>>>>> be
>>>>>>> placed on supporting file specific metadata.  The first step
>>>>>>> should be
>>>>>>> to
>>>>>>> define the
>>>>>>> - interface changes in !LoadFunc and the interface to external
>>>>>>> data
>>>>>>> catalogs.
>>>>>>> + interface changes in !LoadFunc, !StoreFunc and the interface
>>>>>>> to
>>>>>>> external
>>>>>>> data catalogs.

View raw message