hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mathieu Poumeyrol <poumey...@idm.fr>
Subject Re: [Pig Wiki] Update of "PigMetaData" by AlanGates
Date Fri, 30 May 2008 11:14:48 GMT
All,

Looking at the very extensive list of types of file specificic  
metadata, I think (from experience) that a UDF function may need to  
attach some information (any information, actualy) to a given field  
(or file) to be retrieved by another UDF downstream.

What about adding a Map<String, Serializable> to each file and each  
field ?

-- 
Mathieu

Le 30 mai 08 à 01:24, pi song a écrit :

> Alan,
>
> I will start thinking about this as well. When do you want to start  
> the
> implementation?
>
> Pi
>
> On 5/29/08, Apache Wiki <wikidiffs@apache.org> wrote:
>>
>> Dear Wiki user,
>>
>> You have subscribed to a wiki page or wiki category on "Pig Wiki" for
>> change notification.
>>
>> The following page has been changed by AlanGates:
>> http://wiki.apache.org/pig/PigMetaData
>>
>>
>> ------------------------------------------------------------------------------
>> information, histograms, etc.
>>
>> == Pig Interface to File Specific Metadata ==
>> - Pig should support four options with regard to file specific  
>> metadata:
>> + Pig should support four options with regard to reading file  
>> specific
>> metadata:
>>  1.  No file specific metadata available.  Pig uses the file as  
>> input with
>> no knowledge of its content.  All data is assumed to be !ByteArrays.
>>  2.  User provides schema in the script.  For example, `A = load  
>> 'myfile'
>> as (a: chararray, b: int);`.
>>  3.  Self describing data.  Data may be in a format that describes  
>> the
>> schema, such as JSON.  Users may also have other proprietary ways  
>> to store
>> information about the data in a file either in the file itself or  
>> in an
>> associated file.  Changes to the !LoadFunc interface made as part  
>> of the
>> pipeline rework support this for data type and column layout only.   
>> It will
>> need to be expanded to support other types of information about the  
>> file.
>>  4.  Input from a data catalog.  Pig needs to be able to query an  
>> external
>> data catalog to acquire information about a file.  All the same  
>> information
>> available in option 3 should be available via this interface.  This
>> interface does not yet exist and needs to be designed.
>>
>> + It should support options 3 and 4 for writing file specific  
>> metadata as
>> well.
>> +
>> == Pig Interface to Global Metadata ==
>> - An interface will need to be designed for pig to interface to an  
>> external
>> data catalog.
>> + An interface will need to be designed for pig to read from and  
>> write to
>> an external data catalog.
>>
>> == Architecture of Pig Interface to External Data Catalog ==
>> Pig needs to be able to connect to various types of external data  
>> catalogs
>> (databases, catalogs stored in flat files, web services, etc.).  To
>> facilitate this
>> - pig will develop a generic interface that allows it to make  
>> specific
>> types of queries to a data catalog.  Drivers will then need to be  
>> written to
>> implement
>> + pig will develop a generic interface that allows it to query and  
>> update a
>> data catalog.  Drivers will then need to be written to implement
>> that interface and connect to a specific type of data catalog.
>>
>> == Types of File Specific Metadata Pig Will Use ==
>> - Pig should be able to acquire the following types of information  
>> about a
>> file via either self description or an external data catalog.  This  
>> is not
>> to say
>> + Pig should be able to acquire and record the following types of
>> information about a file via either self description or an external  
>> data
>> catalog.  This is not to say
>> that every self describing file or external data catalog must  
>> support every
>> one of these items.  This is a list of items pig may find useful  
>> and should
>> be
>> - able to query for.  If the metadata source cannot provide the
>> information, pig will simply not make use of it.
>> + able to query for and create.  If the metadata source cannot  
>> provide or
>> store the information, pig will simply not make use of it or record  
>> it.
>>  * Field layout (already supported)
>>  * Field types (already supported)
>>  * Sortedness of the data, both key and direction (ascending/ 
>> descending)
>> @@ -52, +54 @@
>>
>>
>> == Priorities ==
>> Given that the usage for global metadata is unclear, the priority  
>> will be
>> placed on supporting file specific metadata.  The first step should  
>> be to
>> define the
>> - interface changes in !LoadFunc and the interface to external data
>> catalogs.
>> + interface changes in !LoadFunc, !StoreFunc and the interface to  
>> external
>> data catalogs.
>>
>>


Mime
View raw message