incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Candler <B.Cand...@pobox.com>
Subject Re: Best way to store 2^32 IPs in CouchDB
Date Mon, 01 Feb 2010 20:43:59 GMT
On Mon, Feb 01, 2010 at 07:50:00PM +0100, Santi Saez wrote:
> El 01/02/10 17:56, Paul Davis escribió:
> 
> Dear Paul,
> 
> >Well, 2^32 of anything is 4GiB per byte stored. So, minimum of four
> >bytes and you're at 16GiB. Even with just 1KiB overhead you're at
> >4TiB.
> >
> >I'm left wondering why you would want to store a list of numbers in
> >the first place.
> 
> Imagine a service like Netcraft.

Then what you want is HTTP virtual hosts, not IP addresses?

Remember that one IP address can serve tens of thousands of virtual hosts. 
(A couchdb document for one IP address could list multiple HTTP hosts within
the JSON, of course)

But according to Netcraft there are around 200M hosts, which is only about
5% of what you were looking at before.  In other words, this is a "sparse"
dataset; there is no value in storing IP addresses which don't have any
information of interest to you.

Another trick which may compact your data is to group it into /24's.  That
is, one JSON document for all of 0.0.0.0-0.0.0.255, another for
0.0.1.0-0.0.1.255 etc.  As well as reducing overhead, there are other
obvious savings (e.g. if you're sweeping network blocks then you can store a
single timestamp to say when the sweep of that /24 was performed)

HTH,

Brian.

Mime
View raw message