accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jared Winick <jaredwin...@gmail.com>
Subject Re: interesting
Date Fri, 03 May 2013 20:09:26 GMT
That is very interesting and sounds like a fun friday project! Could you
please elaborate on how you mapped the original format of

ngram TAB year TAB match_count TAB volume_count NEWLINE

into Accumulo key/values? Could you briefly explain what feature in
Accumulo is responsible for this improvement in storage efficiency. This
could be a helpful illustration for users to know how key/value design can
take advantage of these Accumulo features. Thanks a lot!

Jared


On Fri, May 3, 2013 at 1:24 PM, Eric Newton <eric.newton@gmail.com> wrote:

> I think David Medinets suggested some publicly available data sources that
> could be used to compare the storage requirements of different key/value
> stores.
>
> Today I tried it out.
>
> I took the google 1-gram word lists and ingested them into accumulo.
>
> http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
>
> It took about 15 minutes to ingest on a 10 node cluster (4 drives each).
>
> $ hadoop fs -du -s -h /data/googlebooks/ngrams/1-grams
> running...
> 5.2 G  /data/googlebooks/ngrams/1-grams
>
> $ hadoop fs -du -s -h /accumulo/tables/4
> running...
> 4.1 G  /accumulo/tables/4
>
> The storage format in accumulo is about 20% more efficient than gzip'd csv
> files.
>
> I'll post the 2-gram results sometime next month when its done
> downloading. :-)
>
> -Eric, which occurred 221K times in 34K books in 2008.
>

Mime
View raw message