hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-653) Make fieldsToRead work in loader
Date Fri, 06 Feb 2009 22:11:04 GMT

    [ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12671355#action_12671355
] 

Pradeep Kamath commented on PIG-653:
------------------------------------

Interface for passing required fields information to the loader
Proposal
Two new Classes will be introduced in the API call to the loader for passing information about
required fields.
{code}
class RequiredField {
        String alias; // will hold name of the field (would be null if not supplied)
        int index; // will hold the index (position) of the required field (would be -1 if
not supplied), index is 0 based
        List<RequiredField> subFields; // A list of sub fields in this field (this could
be a list of hash keys for example). This would be null if the entire field is required and
no specific sub fields are required. In the initial implementation only one level of subfields
will be populated.
        byte type; // Type of this field - the value could be any current PIG DataType (as
specified by the constants in DataType class. A new Type BAG_OF_MAP will be added to represent
a bag of maps field).

	// Constructor and getters and setters follow        
	// getters are getAlias(), getIndex(), getSubFields(), getType()
	// setters are setAlias(), setIndex(), setSubFields(), setType()
    }
{code}

NOTE: Both alias and index could be set. The index has a value as perceived by Pig if all
fields were sent to it from the loader.

For performance it would be good if when a single key in a map is requested the loader returns
a map with just that key. Likewise, when the required fields is a key in a bag of map field,
the expected value from the loader would be a bag of map where the maps contain that key (preferably
only that key for performance since this will reduce the data handed by the loader).

{code}
class RequiredFieldResponse {
	boolean requiredFieldRequestHonored; // true if the loader will return a schema containing
only the List of RequiredFields in that order. false if the loader will return all fields
in the data
}
{code}

The reason we have a RequiredFieldResponse class encapsulating the boolean is to allow for
future extensibility. For example, in the future the loader may be able to honor all top level
field requests but not subfields in hashes. So it may hand back top level maps in return for
sub field requests. The loader will then need to inform back to the caller which fields will
be returned exactly the way they were requested and which will be sent as top level fields
(even though the request was for subfields). For the first pass though it is all or none conveyed
through the Boolean.

The API call in LoadFunc will change from 
{code}
void fieldsToRead(Schema schema) 
{code}
to
{code}
RequiredFieldResponse fieldsToRead(List<RequiredField> requiredFields, boolean allFieldsRequired);
{code}

NOTE: 
1.	It is expected that the loader returns the required fields in exactly the same order as
in the List provided in the above call.
3.	The boolean flag allFieldsRequired is set to true when all fields are required. The loader
should first check this flag and use the List<RequiredField> ONLY if this flag is false.

Use Cases
=========

Use Cases which only use aliases
================================
{noformat}
1.	Required fields are columns x (int), y (long)
[
{
	alias=>x,
	index => -1,
	subfields => null,
	type => DataType.INTEGER
},
{
	alias=>y,
	index => -1,
	subfields => null,
	type => DataType.LONG
}
]

2.	Required fields are m1#key1 (map subcolumn), b1#key2 (subcolumn from a bag of maps),
[
{
	alias=>m,
	index => -1,
	subfields => [
{
alias => key1,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be
null!
Type => DataType.BYTEARRAY
}
			       ]
	type => DataType.MAP
},
{
	alias=>b1,
	index => -1,
	subfields => [
{
alias => key2,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be
null!
Type => DataType.BYTEARRAY
}
			       ]
	type => DataType.BAG_OF_MAP
}
]

3.	Required fields are   m2#(key3, key4)  (map subcolumns), b2#(key5, key6) (subcolumns from
bag of maps)
[
{
	alias=>m2,
	index => -1,
	subfields => [
{
alias => key3,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be
null!
Type => DataType.BYTEARRAY
},
{
alias => key4,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be
null!
Type => DataType.BYTEARRAY
}

			       ]
	type => DataType.MAP
},
{
	alias=>b2,
	index => -1,
	subfields => [
{
alias => key5,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be
null!
Type => DataType.BYTEARRAY
},
{
alias => key6,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be
null!
Type => DataType.BYTEARRAY
}

			       ]
	type => DataType.BAG_OF_MAP
},
]
{noformat}

Use Cases which use positional indices
======================================
{noformat}
1.	Required fields are columns $0 (int), $1 (long)
[
{
	alias=>null,
	index => 0,
	subfields => null,
	type => DataType.INTEGER
},
{
	alias=>null,
	index => 1,
	subfields => null,
	type => DataType.LONG
}
]

2.	Required fields are $0#key1 (map subcolumn), $2#key2 (subcolumn from a bag of maps),
[
{
	alias=>null,
	index => 0,
	subfields => [
{
alias => key1,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be
null!
Type => DataType.BYTEARRAY
}
			       ]
	type => DataType.MAP
},
{
	alias=>null,
	index => 2,
	subfields => [
{
alias => key2,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be
null!
Type => DataType.BYTEARRAY
}
			       ]
	type => DataType.BAG_OF_MAP
}
]

3.	Required fields are   $5#(key3, key4)  (map subcolumns), $3#(key5, key6) (subcolumns from
bag of maps)
[
{
	alias=>null,
	index => 5,
	subfields => [
{
alias => key3,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be
null!
Type => DataType.BYTEARRAY
},
{
alias => key4,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be
null!
Type => DataType.BYTEARRAY
}

			       ]
	type => DataType.MAP
},
{
	alias=>null,
	index => 3,
	subfields => [
{
alias => key5,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be
null!
Type => DataType.BYTEARRAY
},
{
alias => key6,
index => -1,
subfields => null, // only one sublevel in the initial implementation, so this has to be
null!
Type => DataType.BYTEARRAY
}

			       ]
	type => DataType.BAG_OF_MAP
},
]
{noformat}



> Make fieldsToRead work in loader
> --------------------------------
>
>                 Key: PIG-653
>                 URL: https://issues.apache.org/jira/browse/PIG-653
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>
> Currently pig does not call the fieldsToRead function in LoadFunc, thus it does not provide
information to load functions on what fields are needed.  We need to implement a visitor that
determines (where possible) which fields in a file will be used and relays that information
to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message