couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Day <sja...@gmail.com>
Subject Re: Best way to store 2^32 IPs in CouchDB
Date Mon, 01 Feb 2010 23:16:02 GMT
I thought I'd weigh in on this to illustrate the differences in the use
cases between heterogeneous document based data vs homogeneous data, such as
IP address adjacencies. I have a bit of a networking background, so if I am
way off here in your intent, this may at least be an interesting set of
commentary regarding a hypothetical couchdb "router".

My assumption here is that your problem seems similar to a very specialized
problem that is "solved" in routers. Typically, in a network routing table,
a highly specialized mtrie structure is used, with a depth of 4 and 255
leaves per node (obviously not all populated), that stores adjacency
information at the "leaves". It allows one to look up the adjacency entries
for any ip address with 4 array lookups.

Couchdb is much more generalized, so I wouldn't expect it to perform as well
in this case when compared to the mtrie. The idea behind couch is to
store heterogeneous data, then provide indexes on this. And even though
couch isn't designed to hold adjacency data, its flexibility allows it to do
something very similar to the mtrie. Lets say we have a database filled with
adjacency documents that are the basis of a simplified routing system on
couch. Entries might look like this:

"3.0.0.1" -> {"networks": ["1.0.0.0/24", "2.0.0.0/24"]}

Let's say the semantics here are "Networks 1.0.0.0/24 and 2.0.0.0/24 are
available via 3.0.0.1". This would basically be an adjacency entry. Your
view code would produce routing pairs from the adjacency information above
(assume it can generate keys from netmasks):

"1.0.0.1" -> "3.0.0.1"
"1.0.0.2" -> "3.0.0.1"
...
"1.0.0.254" -> "3.0.0.1"

Then, again semantically, you might ask "how do i get to 1.0.0.16?". Your
view would respond with "3.0.0.1". Despite this "working", the storage
required here is orders of magnitude larger than that required for
the homogeneous mtrie, especially because ip addresses now take up to 16
bytes, instead of 4, not to mention the storage for the metadata of each
adjacency (rev and id) and the size of b-tree to store and index it. This is
the cost of flexibility. If your data is very homogeneous in that every
single key can be represented as the same type more efficiently that the
string representation, such an ip address, then couchdb may not be the right
tool.

I hope this helps.

Stephen J Day

On Mon, Feb 1, 2010 at 12:43 PM, Brian Candler <B.Candler@pobox.com> wrote:

> On Mon, Feb 01, 2010 at 07:50:00PM +0100, Santi Saez wrote:
> > El 01/02/10 17:56, Paul Davis escribió:
> >
> > Dear Paul,
> >
> > >Well, 2^32 of anything is 4GiB per byte stored. So, minimum of four
> > >bytes and you're at 16GiB. Even with just 1KiB overhead you're at
> > >4TiB.
> > >
> > >I'm left wondering why you would want to store a list of numbers in
> > >the first place.
> >
> > Imagine a service like Netcraft.
>
> Then what you want is HTTP virtual hosts, not IP addresses?
>
> Remember that one IP address can serve tens of thousands of virtual hosts.
> (A couchdb document for one IP address could list multiple HTTP hosts
> within
> the JSON, of course)
>
> But according to Netcraft there are around 200M hosts, which is only about
> 5% of what you were looking at before.  In other words, this is a "sparse"
> dataset; there is no value in storing IP addresses which don't have any
> information of interest to you.
>
> Another trick which may compact your data is to group it into /24's.  That
> is, one JSON document for all of 0.0.0.0-0.0.0.255, another for
> 0.0.1.0-0.0.1.255 etc.  As well as reducing overhead, there are other
> obvious savings (e.g. if you're sweeping network blocks then you can store
> a
> single timestamp to say when the sweep of that /24 was performed)
>
> HTH,
>
> Brian.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message