hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-207) Change SerDe API to allow skipping unused columns
Date Mon, 05 Jan 2009 17:59:44 GMT

    [ https://issues.apache.org/jira/browse/HIVE-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660832#action_12660832
] 

Joydeep Sen Sarma commented on HIVE-207:
----------------------------------------

the deserializer api does get one column at a time. the deserialize() call doesn't have to
do anything - it only has to return a handle back for lazy deserialization (where for example
- the handle can contain a reference to a byte array). later on specific operators will invoke
ObjectInspector interfaces to get access to particular columns - and at this point the objectinspector
interface can be implemented to deserialize the relevant part of the byte array (for example).

the default reflection based objectinspector does not work this way - but this is a matter
of implementation (we just haven't gotten around to lazy deserialization - and anyway it's
dependent on the serialization format).

if u can try and implement lazy deserialization for protocol buffers - that will tell us what
else needs to be added in terms of interfaces (right now i am confident that we have enough
interfaces, to for example, do lazy deserialization of delimited string format).

> Change SerDe API to allow skipping unused columns
> -------------------------------------------------
>
>                 Key: HIVE-207
>                 URL: https://issues.apache.org/jira/browse/HIVE-207
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor, Serializers/Deserializers
>            Reporter: David Phillips
>
> A deserializer shouldn't have to deserialize columns that are never used by the query
processor.  A serializer shouldn't have to examine unused columns that are known to always
be null.
> As an example, we store data as a Protocol Buffer structure with ~60 fields.  Running
a "select count(1)" currently requires deserializing all fields, which includes checking if
they exist and formatting the data appropriately.  This is expensive and unnecessary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message