hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tom White (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce
Date Mon, 25 Feb 2008 19:24:51 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Tom White updated HADOOP-1986:

    Attachment: SequenceFileWriterBenchmark.java

I've written a local benchmark to see the effect of the patch. Focusing on RandomWriter, the
input to the map is not read from disk, and there are no reducers, so the bulk of the processing
is writing the random output to a SequenceFile. This benchmark simulates this pattern by writing
Writable keys and values to an in-memory filesystem. The file was 256MB, keys and values 256
bytes.  Here are the numbers (using Java 6) averaged over 50 runs.

Trunk: 1301912844 ns
Patch: 1338563600 ns

This is a 2.8% overhead. When writing to disk I get the following numbers:

Trunk: 5431308533 ns
Patch: 5604898533 ns

A 3.1% overhead. I was surprised by this as I thought that the overhead would be insignificant
compared to disk IO.

I altered the patch to special-case SequenceFile.Writer.append(Writable, Writable) and the
times were the same as trunk (within 0.2% in either direction).

So it seems to me that we can avoid any overhead by special casing Writable. As well as in
SequenceFile.Writer this would need doing in MapTask.MapOutputBuffer and ReduceTask.ValuesIterator.
I think this can be done with minimal code duplication, and obviously it is not as clean a
solution as the current patch, but given the performance constraints and the general desire
to get this issue fixed, I think it is the best way to proceed.


> Add support for a general serialization mechanism for Map Reduce
> ----------------------------------------------------------------
>                 Key: HADOOP-1986
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1986
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Tom White
>            Assignee: Tom White
>             Fix For: 0.17.0
>         Attachments: hadoop-serializer-v2.tar.gz, SequenceFileWriterBenchmark.java, SerializableWritable.java,
serializer-v1.patch, serializer-v2.patch, serializer-v3.patch, serializer-v4.patch, serializer-v5.patch
> Currently Map Reduce programs have to use WritableComparable-Writable key-value pairs.
While it's possible to write Writable wrappers for other serialization frameworks (such as
Thrift), this is not very convenient: it would be nicer to be able to use arbitrary types
directly, without explicit wrapping and unwrapping.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message