accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: interesting
Date Wed, 15 May 2013 16:11:12 GMT
Just kidding, re-read the rest of this. Let me try again:

Any intents to retry this with different compression codecs?

On 5/15/13 12:00 PM, Josh Elser wrote:
> RFile... with gzip? Or did you use another compressor?
>
> On 5/15/13 10:58 AM, Eric Newton wrote:
>> I ingested the 2-gram data on a 10 node cluster.  It took just under 7
>> hours.  For most of the job, accumulo ingested at about 200K k-v/server.
>>
>> $ hadoop fs -dus /accumulo/tables/2 /data/n-grams/2-grams
>> /accumulo/tables/274632273653
>> /data/n-grams/2-grams154271541304
>>
>> That's a very nice result.  RFile compressed the same data to half the
>> gzip'd CSV format.
>>
>> There are 37,582,158,107 entries in the 2-gram set, which means that
>> accumulo is using only 2 bytes for each entry.
>>
>> -Eric Newton, which appeared 62 times in 37 books in 2008.
>>
>>
>> On Fri, May 3, 2013 at 7:20 PM, Eric Newton <eric.newton@gmail.com
>> <mailto:eric.newton@gmail.com>> wrote:
>>
>>     ngram == row
>>     year == column family
>>     count == column qualifier (prepended with zeros for sort)
>>     book count == value
>>
>>     I used ascii text for the counts, even.
>>
>>     I'm not sure if the google entries are sorted, so the sort would
>>     help compression.
>>
>>     The RFile format does not repeat identical data from key to key, so
>>     in most cases, the row is not repeated.  That gives gzip other
>>     things to work on.
>>
>>     I'll have to do more analysis to figure out why RFile did so well.
>>       Perhaps google used less aggressive settings for their compression.
>>
>>     I'm more interested in 2-grams to test our partial-row compression
>>     in 1.5.
>>
>>     -Eric
>>
>>
>>     On Fri, May 3, 2013 at 4:09 PM, Jared Winick <jaredwinick@gmail.com
>>     <mailto:jaredwinick@gmail.com>> wrote:
>>
>>         That is very interesting and sounds like a fun friday project!
>>         Could you please elaborate on how you mapped the original
>> format of
>>
>>         ngram TAB year TAB match_count TAB volume_count NEWLINE
>>
>>         into Accumulo key/values? Could you briefly explain what feature
>>         in Accumulo is responsible for this improvement in storage
>>         efficiency. This could be a helpful illustration for users to
>>         know how key/value design can take advantage of these Accumulo
>>         features. Thanks a lot!
>>
>>         Jared
>>
>>
>>         On Fri, May 3, 2013 at 1:24 PM, Eric Newton
>>         <eric.newton@gmail.com <mailto:eric.newton@gmail.com>> wrote:
>>
>>             I think David Medinets suggested some publicly available
>>             data sources that could be used to compare the storage
>>             requirements of different key/value stores.
>>
>>             Today I tried it out.
>>
>>             I took the google 1-gram word lists and ingested them into
>>             accumulo.
>>
>>
>> http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
>>
>>             It took about 15 minutes to ingest on a 10 node cluster (4
>>             drives each).
>>
>>             $ hadoop fs -du -s -h /data/googlebooks/ngrams/1-grams
>>             running...
>>             5.2 G  /data/googlebooks/ngrams/1-grams
>>
>>             $ hadoop fs -du -s -h /accumulo/tables/4
>>             running...
>>             4.1 G  /accumulo/tables/4
>>
>>             The storage format in accumulo is about 20% more efficient
>>             than gzip'd csv files.
>>
>>             I'll post the 2-gram results sometime next month when its
>>             done downloading. :-)
>>
>>             -Eric, which occurred 221K times in 34K books in 2008.
>>
>>
>>
>>

Mime
View raw message