accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jared winick <jaredwin...@gmail.com>
Subject Re: Trendulo - A Twitter Analytics Demo on Accumulo
Date Mon, 30 Apr 2012 13:33:46 GMT
Here is an up-to-date estimate. I naively reported disk usage as the "Disk
Used" field under the Accumulo Master section of the monitor. Currently it
appears I am only actually using ~26 GB of storage for my Accumulo tables.
This is based on the "% Used" * "Unreplicated Capacity" fields in the
NameNode section of the monitor which is also corroborated by looking the
the file system usage for the HDFS data directories. I have no other data
in HDFS.

Dec 24 - Apr 30 = 128 days
3.0 billion entries / 128 days = 23.4 million entries/day
23.4 million entries/day / 1.2 million tweets/day  ~ 20 entries/tweet  (not
sure if I misrepresented the number of tweets per day as 3 million before,
but it is about 1.2)

26GB / ( 128 * 1.2e6 ) ~ 182 bytes/tweet

I am using the VARLEN encoding for the SummingCombiner which probably helps
save a lot of space as I would imagine there are a lot of entries with a
very small count as the language used on Twitter is far from normal.

On Fri, Apr 27, 2012 at 1:09 PM, Eric Newton <eric.newton@gmail.com> wrote:

>
> On Wed, Apr 25, 2012 at 3:10 PM, Jared winick <jaredwinick@gmail.com>wrote:
>
>> I am not exactly sure how to answer the question about storage size per
>> tweet as I am not actually storing the original tweet and if a counter
>> already exists for an n-gram/time period, then incrementing that counter
>> doesn't increase the storage size. I can follow up with the current storage
>> I am using though.
>>
>
> I see I can make some estimates based on the information in your talk. The
> slides are awesome, btw.
>
> Using the information you provided: Dec 24 - March 12... that's 88 days.
>  2.6e9 entries, 3 million-ish tweets per day:
>
> 2.6e9 / (3e6 * 88)
>
> ~10 entries per tweet.
>
> Also, you report disk usage of 72G,  which I will interpret as 72 * (1024
> ** 3) bytes.
>
> So, each tweet, on average occupies: 72G / (88 * 3e6) Or, ~300 bytes.
>
> -Eric
>

Mime
View raw message