hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2853) Add Writable for very large lists of key / value pairs
Date Tue, 11 Mar 2008 18:58:47 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577560#action_12577560

Chris Douglas commented on HADOOP-2853:


Did the preceding suggestion a) make sense and b) work for your use case? You're right about
the first example: it was assumed that the set of hosts was relatively small compared to the
set of URIs and small enough to fit in memory. If that's not the case, then the latter implementation
should provide equal if not better performance with only the overhead of storing the hostStat
tuple for its matching URIs in the reduce.

If the preceding approach resolves your issue, are there other cases that this patch might
address? Again: paging out a list of records- thereby hiding it from the framework- seems
to make a worst case of the locality that map/reduce is designed to exploit. If there are
sound reasons for that, then we should by all means make this available, but at first glance
it appears to be a coarse grained amalgamation best effected by the very framework it bypasses.
The idiom for bundling a set of records associated by some secondary context is to "tag" keys;
are there situations where that is insufficient or inelegant in ways that this patch is not?

> Add Writable for very large lists of key / value pairs
> ------------------------------------------------------
>                 Key: HADOOP-2853
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2853
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>    Affects Versions: 0.17.0
>            Reporter: Andrzej Bialecki 
>             Fix For: 0.17.0
>         Attachments: sequenceWritable-v1.patch, sequenceWritable-v2.patch, sequenceWritable-v3.patch,
sequenceWritable-v4.patch, sequenceWritable-v5.patch
> Some map-reduce jobs need to aggregate and process very long lists as a single value.
This usually happens when keys from a large domain are mapped into a small domain, and their
associated values cannot be aggregated into few values but need to be preserved as members
of a large list. Currently this can be implemented as a MapWritable or ArrayWritable - however,
Hadoop needs to deserialize the current key and value completely into memory, which for extremely
large values causes frequent OOM exceptions. This also works only with lists of relatively
small size (e.g. 1000 records).
> This patch is an implementation of a Writable that can handle arbitrarily long lists.
Initially it keeps an internal buffer (which can be (de)-serialized in the ordinary way),
and if the list size exceeds certain threshold it is spilled to an external SequenceFile (hence
the name) on a configured FileSystem. The content of this Writable can be iterated, and the
data is pulled either from the internal buffer or from the external file in a transparent

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message