hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pete Wyckoff (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-4224) [Hive] Port Hive's serialization/deserialization to the new Serialization framework
Date Fri, 19 Sep 2008 18:04:44 GMT
[Hive] Port Hive's serialization/deserialization to the new Serialization framework

                 Key: HADOOP-4224
                 URL: https://issues.apache.org/jira/browse/HADOOP-4224
             Project: Hadoop Core
          Issue Type: Improvement
          Components: contrib/hive, contrib/serialization
            Reporter: Pete Wyckoff

Problem 1: legacy data

This is non-trivial because of legacy Hive data which is written as BytesWritable in the SequenceFile
value key.  The specific RecordIO/Thrift/X class name is stored in the metastore. 

If we write our own SequenceFileRecordReader, this is trivial, but the standard reader assumes
the SequenceFile has the actual class name and thus we cannot  deserialize at this level as
we would just get back bytes writable. We need the SequenceFileRecordReader to consult the
Deserializer as to what the actual class being deserialized is.

I don't know if this is a common problem of writing data as just byteswritable and storing
the real class somewhere else, but for us it is an issue.

Otherwise, there's soon to be a ThriftSerialization set of classes and we can add ones for
our other serdes.

Problem 2: DynamicSerDe

This is a serializer/deserializer that takes a thrift DDL at *runtime* and can serialize/deserialize
thrift/non thrift data.  Thus, the class name DynamicSerDe doesn't give you what you need,
namely the DDL and the protocol used for the serialization - Binary or Control Separated.
(in theory json, xml, ...)  

We can store this DDL in the metastore (and we do), but then DynamicSerDe must be used only
with Hive.  Maybe we should output only to TFiles where we could put the DDL in the metadata
for the file.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message