accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Newton <>
Subject interesting
Date Fri, 03 May 2013 19:24:28 GMT
I think David Medinets suggested some publicly available data sources that
could be used to compare the storage requirements of different key/value

Today I tried it out.

I took the google 1-gram word lists and ingested them into accumulo.

It took about 15 minutes to ingest on a 10 node cluster (4 drives each).

$ hadoop fs -du -s -h /data/googlebooks/ngrams/1-grams
5.2 G  /data/googlebooks/ngrams/1-grams

$ hadoop fs -du -s -h /accumulo/tables/4
4.1 G  /accumulo/tables/4

The storage format in accumulo is about 20% more efficient than gzip'd csv

I'll post the 2-gram results sometime next month when its done downloading.

-Eric, which occurred 221K times in 34K books in 2008.

View raw message