hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Milind Bhandarkar (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2030) Some changes to Record I/O interfaces
Date Tue, 06 Nov 2007 20:26:50 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540548

Milind Bhandarkar commented on HADOOP-2030:

The tag was added for supporting xml and other formats like json, which have the ability to
create a class dynamically to refer to fields natively by their names. The ddl (or typeinfo)
was not fed to the recordoutput and recordinput interfcaes. If the typeinfo is fed to the
construction of recordInput/recordOutput, then the need for tag is lessened. (It provides
an opportunity for better error checking for xml serialized records to have a fieldname in

Also, the serialize and deserialize methods generated for each class used to call startRecord
and endRecord. This meant that the record which is being serialized did not need to know whether
it was a top-level record or embedded record. With your proposal, either the serialize/deserialize
would have to know it, or the user will have to call methods on RecordOutput/recordInput to
start/end top-level record.

I agree with you that have a string contain name/be empty is a bad indicator of top-level
record, but it did simplify serialization interfaces. The user of the generated class did
not have to know RecordInput or RecordOutput methods.

> Some changes to Record I/O interfaces
> -------------------------------------
>                 Key: HADOOP-2030
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2030
>             Project: Hadoop
>          Issue Type: Improvement
>            Reporter: Vivek Ratan
> I wanted to suggest some changes to the Record I/O interfaces. 
> Under org.apache.hadoop.record, _RecordInput_ and _RecordOutput_ are the interfaces to
serialize and deserialize basic types for Java-generated stubs. All the methods in _RecordInput_
and _RecordOutput_ take a parameter, a string, called 'tag'. As far as I can see, this tag
is used only for XML-based serialization, to write out the name of the field that is being
serialized.A lot of the  methods ignore it. My proposal is to eliminate this parameter, for
a number of reasons: 
> - We don't need to write the name of a field when serializing in XML. None of the other
serializers (for binary or CSV) write out the name of a field - we only write the field value.
The generated stubs know which field is associated with which value (and now, with type information
support, the field name is part of the type information and is not required to be serialized
along with the field data). In fact, even in XML, I don't see the field name being read back
in, so it serves no purpose whatsoever. 
> - The tag is used occasionally in the error message, but again this can be handled better
by the caller of _RecordInput_ and _RecordOutput_. 
> - The tag is also used to detect whether a record is nested or not. In CSV, we wrap nested
records with "s{}". We also want to know whether a record is nested or the top-most, so that
we add a newline at the end of a top-most record. If a tag is empty, it is assumed that the
record is the top-most. This is using the tag parameter to mean something else. It's far more
readable to just pass in a boolean to _startRecord()_ and _endRecord()_ which directly indicates
whether the record is nested or not. Or, add two additional methods to _RecordOutput_ and
_RecordInput_: _start()_ and _stop()_, which are called at the beginning and end of every
top-most record while _startRecord()_ and _endRecord()_ are used only for nested records.
The former's slightly better, IMO, but each method is much better than using an empty tag
to indicate a top-level record.
> The issue with tags brings up a related issue. Sometimes, we may need to pass in additional
information to _RecordInput_ or _RecordOutput_. For example, suppose we do need to write the
field name along with the field value. We can think of such a requirement in two ways. A)
Such decisions of what to serialize/deserialize are independent of the format/protocol that
the data is serialized in. If we want to write something else, that should be written separately
by the stub. So, if we want to serialize the field name before a field value, a stub should
call _RecordOutput.writeString(<field name>)_ first, followed by _RecordOutput.writeInt(<field
value>)_. The methods in _RecordInput_ and _RecordOutput_  are the lowest level methods
and they should just be concerned with writing individual types.  B) What if a protocol wants
to write things differently? For example, we may want to write the field name before the field
value for XML only (for debugging sake, or for whatever else). Or it may be that the field
name and field value need to be enclosed in certain tags that can't happen if you write them
separately. In these cases, methods in _RecordInput_ and _RecordOutput_ need to be passed
additional information. This can be done by providing an optional parameter for these methods.
Maybe a structure/class containing field information, or a reference to the field itself (the
Tag parameter was meant to serve a similar purpose, but just passing in a String may be inadequate).
For now, there is no real need for either of these situations, so we should be OK with getting
rid of the tag parameter. 
> Similar changes need to be done to the C++ side, where we have _OArchive_ and _IArchive_:

> - The tag parameter needs to be removed
> - _startRecord()_ and _endRecord()_ in _OArchive_ and _IArchive_ need to take a boolean
parameter that indicates whether the record is nested or not
> - Currently, both _startRecord()_ and _endRecord()_ in  _IArchive_ take an additional
parameter, a reference to a hadoop record. This is never used anywhere not required (the corresponding
methods in _RecordInput_ and _RecordOutput_ don't take any parameters, which is the right
thing to do), and should be removed. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message