hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2853) Add Writable for very large lists of key / value pairs
Date Fri, 07 Mar 2008 20:51:46 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576368#action_12576368
] 

Andrzej Bialecki  commented on HADOOP-2853:
-------------------------------------------

The use case is quite simple, and if there is a better way to do it I'm all ears - especially
if it can be still realized as a single job.

I have one set of host-level statistics <host, hostStats>, and another set of page-level
data <url, urlMeta>. With this particular dataset, there are many thousand urls coming
from a single host.

The operation I need to do is to apply a function to each urlMeta based on urlMeta and hostStats,
and output a modified set of <url, urlMeta>. The way to do it would be the following:

MAP: input <host, hostStats> and <url, urlMeta>; output <host, <url, urlMeta>
> and <host, hostStats>

REDUCE: for each host key iterate and collect all input values in an in-memory list, until
<host, hostStats> comes in so that we can apply it to collected values - and this may
happen at any position in the iterator, even very late, because AFAIK the order of values
in this Iterator is unspecified. After this is done, then output the collected list as <url,
urlMeta' > * .

Here however I run into OOM, because I need to keep a large in-memory list. It would be better
to keep it on disk. The SequenceWritable allows me to do this - in effect it replaces the
Iterator in reduce() with an on-disk List.

> Add Writable for very large lists of key / value pairs
> ------------------------------------------------------
>
>                 Key: HADOOP-2853
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2853
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>    Affects Versions: 0.17.0
>            Reporter: Andrzej Bialecki 
>             Fix For: 0.17.0
>
>         Attachments: sequenceWritable-v1.patch, sequenceWritable-v2.patch, sequenceWritable-v3.patch,
sequenceWritable-v4.patch, sequenceWritable-v5.patch
>
>
> Some map-reduce jobs need to aggregate and process very long lists as a single value.
This usually happens when keys from a large domain are mapped into a small domain, and their
associated values cannot be aggregated into few values but need to be preserved as members
of a large list. Currently this can be implemented as a MapWritable or ArrayWritable - however,
Hadoop needs to deserialize the current key and value completely into memory, which for extremely
large values causes frequent OOM exceptions. This also works only with lists of relatively
small size (e.g. 1000 records).
> This patch is an implementation of a Writable that can handle arbitrarily long lists.
Initially it keeps an internal buffer (which can be (de)-serialized in the ordinary way),
and if the list size exceeds certain threshold it is spilled to an external SequenceFile (hence
the name) on a configured FileSystem. The content of this Writable can be iterated, and the
data is pulled either from the internal buffer or from the external file in a transparent
way.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message