hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Milind Bhandarkar (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-941) Make Hadoop Record I/O Easier to use outside Hadoop
Date Sun, 25 Feb 2007 00:51:05 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12475673

Milind Bhandarkar commented on HADOOP-941:


Thanks for your clarification. For the entire duration fo this debate I have been thinking
from the record I/O users' perspective. Here is a recent example that happened with Hadoop.
We were using Lucene's PriorityQueue class. And that was causing dependency on the Lucene
jar. Which was deemed unacceptable, so we copied Lucene's PriorityQueue class to Hadoop, and
got rid of the Lucene dependency.

I see Hadoop record I/O's users' dependency on the entire Hadoop jar the same way. Users of
Hadooop record I/O notice that they are essentialy using only two interfaces (not even implementations)
from outside of record I/O. It is obvious that the dependency on Hadoop jar is something they
will like to avoid (similar to what we did with dependency on the entire Lucene Jar).

If one sees things from the record I/O users' perspectives, I think allowing users to generate
code to be used outside of Hadoop context is the right thing to do.

Maybe I misunderstood your point earlier, which to me seemed like arguing for "one artifact
per project". I agreed to that. That is why proposal #10 has not been implemented in these
changes, nor is it proposed anymore. Hadoop should produce only one tar file as an artifact.
That is why I have not modified build process to produce a separate record I/O artifact. Instead,
for record I/O's users, I have given a simple shelll command that will produce a record I/O
jar file containing only record I/O classes, separate from all other Hadoop classes. Please
educate me about the downside of that.

I value consensus more than anything, and that is why I am trying to explain my position.
Even the merit of changes needs to be decided by consensus.

> Make Hadoop Record I/O Easier to use outside Hadoop
> ---------------------------------------------------
>                 Key: HADOOP-941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-941
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: record
>    Affects Versions: 0.10.1
>         Environment: All
>            Reporter: Milind Bhandarkar
>         Assigned To: Milind Bhandarkar
>         Attachments: jute-patch.txt
> Hadoop record I/O can be used effectively outside of Hadoop. It would increase its utility
if developers can use it without having to import hadoop classes, or having to depend on Hadoop
jars. Following changes to the current translator and runtime are proposed.
> Proposed Changes:
> 1. Use java.lang.String as a native type for ustring (instead of Text.)
> 2. Provide a Buffer class as a native Java type for buffer (instead of BytesWritable),
so that later BytesWritable could be implemented as following DDL:
> module org.apache.hadoop.io {
>   record BytesWritable {
>     buffer value;
>   }
> }
> 3. Member names in generated classes should not have prefixes 'm' before their names.
In the above example, the private member name would be 'value' not 'mvalue' as it is done
> 4. Convert getters and setters to have CamelCase. e.g. in the above example the getter
will be:
>   public Buffer getValue();
> 5. Provide a 'swiggable' C binding, so that processing the generated C code with swig
allows it to be used in scripting languages such as Python and Perl.
> 6. The default --language="java" target would generate class code for records that would
not have Hadoop dependency on WritableComparable interface, but instead would have "implements
Record, Comparable". (i.e. It will not have write() and readFields() methods.) An additional
option "--writable" will need to be specified on rcc commandline to generate classes that
"implements Record, WritableComparable".
> 7. Optimize generated write() and readFields() methods, so that they do not have to create
BinaryOutputArchive or BinaryInputArchive every time these methods are called on a record.
> 8. Implement ByteInStream and ByteOutStream for C++ runtime, as they will be needed for
using Hadoop Record I/O with forthcoming C++ MapReduce framework (currently, only FileStreams
are provided.)
> 9. Generate clone() methods for records in Java i.e. the generated classes should implement
> 10. As part of Hadoop build process, produce a tar bundle for Record I/O alone. This
tar bundle will contain the translator classes and ant task (lib/rcc.jar), translator script
(bin/rcc), Java runtime (recordio.jar) that includes org.apache.hadoop.record.*, sources for
the java runtime (src/java), and c/c++ runtime sources with Makefiles (src/c++, src/c).
> 11. Make generated Java codes for maps and vectors use Java generics.
> These are the proposed user-visible changes. Internally, the translator will be restructured
so that it is easier to plug-in translators for different targets.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message