hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Hammerbacher (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1883) Adding versioning to Record I/O
Date Tue, 18 Sep 2007 08:06:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528295
] 

Jeff Hammerbacher commented on HADOOP-1883:
-------------------------------------------

i'm no expert here, but this approach appears to be sacrificing the long-term benefit of ever
using jute for high-frequency rpc in favor of the short-term benefit of not breaking existing
ddls (and also encourages developers to come up with short names for fields, never a good
thing).  your approach does make for much nicer ddl files, however.  it should be noted that
thrift works just fine without the numbering of fields: it generates field indentifiers counting
down from -1 for unnumbered fields.

this seems like a drawback if one ever wanted to replace hadoop's ipc serialization (about
which i know nothing) with jute, though i'm having trouble concocting scenarios of high-frequency,
low payload rpc in the hadoop system.

> Adding versioning to Record I/O
> -------------------------------
>
>                 Key: HADOOP-1883
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1883
>             Project: Hadoop
>          Issue Type: New Feature
>            Reporter: Vivek Ratan
>
> There is a need to add versioning support to Record I/O. Users frequently update DDL
files, usually by adding/removing fields, but do not want to change the name of the data structure.
They would like older & newer deserializers to read as much data as possible. For example,
suppose Record I/O is used to serialize/deserialize log records, each of which contains a
message and a timestamp. An initial data definition could be as follows:
> {code}
> class MyLogRecord {
>   ustring msg;
>   long timestamp;
> }
> {code}
> Record I/O creates a class, _MyLogRecord_, which represents a log record and can serialize/deserialize
itself. Now, suppose newer log records additionally contain a severity level. A user would
want to update the definition for a log record but use the same class name. The new definition
would be:
> {code}
> class MyLogRecord {
>   ustring msg;
>   long timestamp;
>   int severity;
> }
> {code}
> Users would want a new deserializer to read old log records (and perhaps use a default
value for the severity field), and an old deserializer to read newer log records (and skip
the severity field).
> This requires some concept of versioning in Record I/O, or rather, the additional ability
to read/write type information of a record. The following is a proposal to do this. 
> Every Record I/O Record will have type information which represents how the record is
structured (what fields it has, what types, etc.). This type information, represented by the
class _RecordTypeInfo_, is itself serializable/deserializable. Every Record supports a method
_getRecordTypeInfo()_, which returns a _RecordTypeInfo_ object. Users are expected to serialize
this type information (by calling _RecordTypeInfo.serialize()_) in an appropriate fashion
(in a separate file, for example, or at the beginning of a file). Using the same DDL as above,
here's how we could serialize log records: 
> {code}
> FileOutputStream fOut = new FileOutputStream("data.log");
> CsvRecordOutput csvOut = new CsvRecordOutput(fOut);
> ...
> // get the type information for MyLogRecord
> RecordTypeInfo typeInfo = MyLogRecord.getRecordTypeInfo();
> // ask it to write itself out
> typeInfo.serialize(csvOut);
> ...
> // now, serialize a bunch of records
> while (...) {
>    MyLogRecord log = new MyLogRecord();
>    // fill up the MyLogRecord object
>   ...
>   // serialize
>   log.serialize(csvOut);
> }
> {code}
> In this example, the type information of a Record is serialized fist, followed by contents
of various records, all into the same file. 
> Every Record also supports a method that allows a user to set a filter for deserializing.
A method _setRTIFilter()_ takes a _RecordTypeInfo_ object as a parameter. This filter represents
the type information of the data that is being deserialized. When deserializing, the Record
uses this filter (if one is set) to figure out what to read. Continuing with our example,
here's how we could deserialize records:
> {code}
> FileInputStream fIn = new FileInputStream("data.log");
> // we know the record was written in CSV format
> CsvRecordInput csvIn = new CsvRecordInput(fIn);
> ...
> // we know the type info is written in the beginning. read it. 
> RecordTypeInfo typeInfoFilter = new RecordTypeInfo();
> // deserialize it
> typeInfoFilter.deserialize(csvIn);
> // let MyLogRecord know what to expect
> MyLogRecord.setRTIFilter(typeInfoFilter);
> // deserialize each record
> while (there is data in file) {
>   MyLogRecord log = new MyLogRecord();
>   log.read(csvIn);
>   ...
> }
> {code}
> The filter is optional. If not provided, the deserializer expects data to be in the same
format as it would serialize. (Note that a filter can also be provided for serializing, forcing
the serializer to write information in the format of the filter, but there is no use case
for this functionality yet). 
> What goes in the type information for a record? The type information for each field in
a Record is made up of:
>    1. a unique field ID, which is the field name. 
>    2. a type ID, which denotes the type of the field (int, string, map, etc). 
> The type information for a composite type contains type information for each of its fields.
This approach is somewhat similar to the one taken by [Facebook's Thrift|http://developers.facebook.com/thrift/],
as well as by Google's Sawzall. The main difference is that we use field names as the field
ID, whereas Thrift and Sawzall use user-defined field numbers. While field names take more
space, they have the big advantage that there is no change to support existing DDLs. 
> When deserializing, a Record looks at the filter and compares it with its own set of
{field name, field type} tuples. If there is a field in the data that it doesn't know about
it, it skips it (it knows how many bytes to skip, based on the filter). If the deserialized
data does not contain some field values, the Record gives them default values. Additionally,
we could allow users to optionally specify default values in the DDL. The location of a field
in a structure does not matter. This lets us support reordering of fields. Note that there
is no change required to the DDL syntax, and very minimal changes to client code (clients
just need to read/write type information, in addition to record data). 
> This scheme gives us an addition powerful feature: we can build a generic serializer/deserializer,
so that users can read all kinds of data without having access to the original DDL or the
original stubs. As long as you know where the type information of a record is serialized,
you can read all kinds of data. One can also build a simple UI that displays the structure
of data serialized in any generic file. This is very useful for handling data across lots
of versions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message