I'd be very curious how something faster, like Snappy, compared.

Christopher L Tubbs II
On Wed, May 15, 2013 at 2:52 PM, Eric Newton wrote:
> I don't intend to do that.
On Wed, May 15, 2013 at 12:11 PM, Josh Elser wrote:
>> Just kidding, reread the rest of this. Let me try again:
>> Any intents to retry this with different compression codecs?
On 5/15/13 12:00 PM, Josh Elser wrote:
>>> RFile... with gzip? Or did you use another compressor?
On 5/15/13 10:58 AM, Eric Newton wrote:
>>>> I ingested the 2gram data on a 10 node cluster. It took just under 7
>>>> hours. For most of the job, accumulo ingested at about 200K kv/server.
>>>> $ hadoop fs dus /accumulo/tables/2 /data/ngrams/2grams
>>>> /accumulo/tables/274632273653
>>>> /data/ngrams/2grams154271541304
>>>> That's a very nice result. RFile compressed the same data to half the
>>>> gzip'd CSV format.
>>>> There are 37,582,158,107 entries in the 2gram set, which means that
>>>> accumulo is using only 2 bytes for each entry.
>>>> Eric Newton, which appeared 62 times in 37 books in 2008.
On Fri, May 3, 2013 at 7:20 PM, Eric Newton wrote:
>>>> <mailto:eric.newton@gmail.com>> wrote:
>>>> ngram == row
>>>> year == column family
>>>> count == column qualifier (prepended with zeros for sort)
>>>> book count == value
>>>> I used ascii text for the counts, even.
>>>> I'm not sure if the google entries are sorted, so the sort would
>>>> help compression.
>>>> The RFile format does not repeat identical data from key to key, so
>>>> in most cases, the row is not repeated. That gives gzip other
>>>> things to work on.
>>>> I'll have to do more analysis to figure out why RFile did so well.
>>>> Perhaps google used less aggressive settings for their
>>>> compression.
>>>> I'm more interested in 2grams to test our partialrow compression
>>>> in 1.5.
>>>> Eric
On Fri, May 3, 2013 at 4:09 PM, Jared Winick wrote:
>>>> <mailto:jaredwinick@gmail.com>> wrote:
>>>> That is very interesting and sounds like a fun friday project!
>>>> Could you please elaborate on how you mapped the original
>>>> format of
>>>> ngram TAB year TAB match_count TAB volume_count NEWLINE
>>>>
>>>> into Accumulo key/values? Could you briefly explain what feature
>>>> in Accumulo is responsible for this improvement in storage
>>>> efficiency. This could be a helpful illustration for users to
>>>> know how key/value design can take advantage of these Accumulo
>>>> features. Thanks a lot!
>>>> Jared
On Fri, May 3, 2013 at 1:24 PM, Eric Newton wrote:
>>>> <eric.newton@gmail.com <mailto:eric.newton@gmail.com>>
wrote:
>>>>
>>>> I think David Medinets suggested some publicly available
>>>> data sources that could be used to compare the storage
>>>> requirements of different key/value stores.
>>>> Today I tried it out.
>>>>
>>>> I took the google 1gram word lists and ingested them into
>>>> accumulo.
>>>>
>>>> http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
>>>>
>>>> It took about 15 minutes to ingest on a 10 node cluster (4
>>>> drives each).
>>>>
>>>> $ hadoop fs du s h /data/googlebooks/ngrams/1grams
>>>> running...
>>>> 5.2 G /data/googlebooks/ngrams/1grams
>>>> $ hadoop fs du s h /accumulo/tables/4
>>>> running...
>>>> 4.1 G /accumulo/tables/4
>>>>
>>>> The storage format in accumulo is about 20% more efficient
>>>> than gzip'd csv files.
>>>> I'll post the 2gram results sometime next month when its
>>>> done downloading. :)
>>>> Eric, which occurred 221K times in 34K books in 2008.
