hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-3063) BloomMapFile - fail-fast version of MapFile for sparsely populated key space
Date Sat, 29 Mar 2008 21:53:24 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Andrzej Bialecki  updated HADOOP-3063:

    Status: Patch Available  (was: Open)

> BloomMapFile - fail-fast version of MapFile for sparsely populated key space
> ----------------------------------------------------------------------------
>                 Key: HADOOP-3063
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3063
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.17.0
>            Reporter: Andrzej Bialecki 
>         Attachments: bloommap-v2.patch, bloommap.patch
> The need for this improvement arose when working with large ancillary MapFile-s (essentially
used as external dictionaries). For each invokation of map() / reduce() it was necessary to
perform several look-ups in these MapFile-s, and in case of sparsely populated key-space the
cost of finding that a key is absent was too high.
> This patch implements a subclass of MapFile that creates a Bloom filter from all keys,
so that accurate tests for absence of keys can be performed quickly and with 100% accuracy.
> Writer.append() operations update a DynamicBloomFilter, which is then serialized when
the Writer is closed. This filter is loaded in memory when a Reader is created. Reader.get()
operation first checks the filter for the key membership, and if the key is absent it immediately
returns null without doing any further IO.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message