hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Binglin Chang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5793) High speed compression algorithm like BMDiff
Date Sun, 16 Jan 2011 17:04:52 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982338#action_12982338
] 

Binglin Chang commented on HADOOP-5793:
---------------------------------------

Luke: I read the paper "Data compression using long common strings" which discribes BMDiff,
it seems that the main advance of BMDiff is be capable of finding long common strings in the
entire file(not only the sliding window in dict based algorithms) but hadoop use a streaming
compression framework, which sends one block(buffer) at a time to compressor/decompressor,
which prevents BMDiff from finding repeated strings in the entire file, and maybe leads to
bad compression results? Is there any test results shows the relationship between pack(buffer)
size, compression speed and ratio?

> High speed compression algorithm like BMDiff
> --------------------------------------------
>
>                 Key: HADOOP-5793
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5793
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: elhoim gibor
>            Assignee: Michele Catasta
>            Priority: Minor
>
> Add a high speed compression algorithm like BMDiff.
> It gives speeds ~100MB/s for writes and ~1000MB/s for reads, compressing 2.1billions
web pages from 45.1TB in 4.2TB
> Reference:
> http://norfolk.cs.washington.edu/htbin-post/unrestricted/colloq/details.cgi?id=437
> 2005 Jeff Dean talk about google architecture - around 46:00.
> http://feedblog.org/2008/10/12/google-bigtable-compression-zippy-and-bmdiff/
> http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=755678
> A reference implementation exists in HyperTable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message