hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luke Lu (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5793) High speed compression algorithm like BMDiff
Date Mon, 17 Jan 2011 04:06:50 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982470#action_12982470
] 

Luke Lu commented on HADOOP-5793:
---------------------------------

Binglin: BMDiff is not a general purpose compression scheme. If you read the Bigtable paper,
you'd see that it's primarily used for row compression with many similar documents (e.g. url
-> list of pages crawled at different dates) The code will probably be most useful for
HBase region files or TFile (in hadoop common, used by Hive etc.). You'll have to test with
your own data to see applicable results. 

> High speed compression algorithm like BMDiff
> --------------------------------------------
>
>                 Key: HADOOP-5793
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5793
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: elhoim gibor
>            Assignee: Michele Catasta
>            Priority: Minor
>
> Add a high speed compression algorithm like BMDiff.
> It gives speeds ~100MB/s for writes and ~1000MB/s for reads, compressing 2.1billions
web pages from 45.1TB in 4.2TB
> Reference:
> http://norfolk.cs.washington.edu/htbin-post/unrestricted/colloq/details.cgi?id=437
> 2005 Jeff Dean talk about google architecture - around 46:00.
> http://feedblog.org/2008/10/12/google-bigtable-compression-zippy-and-bmdiff/
> http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=755678
> A reference implementation exists in HyperTable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message