On Tue, Nov 1, 2011 at 7:28 PM, Aaron Turner <synfinatic@gmail.com> wrote:
On Tue, Nov 1, 2011 at 9:17 AM, Tamas Marki <tmarki@gmail.com> wrote:
> Hello,
> I'm new to the list and also to Cassandra. I found it when I was searching
> for something to replace our busy mysql server.
> One of the things we use the server for is filtering IPs based on a list of
> IP ranges. These ranges can be small and big, and there are about 50k of
> them in the database.
> In mysql this is pretty quick: they are stored as integers, and the query
> basically looks like (say ip is the ip we want to find the all the ranges
> for):
> select range from rangelist where ip_start<=ip and ip_end>=ip;
> I tried to move this schema to Cassandra, but it turned out to be very slow,
> even with indexes on both columns. Since I also had to have an EQ expression
> in the query, I added an indexed text field which was the same for all rows,
> so the query in cassandra was something like this:
> select range from rangelist where type='ip' and ip_start<=ip and ip_end>=ip;
> This was very slow, and I imagine it is because it has to scan through all
> the rows, making the index useless.
> The second thing I tried was to just expand the ranges and store individual
> IPs as the keys to a column family. This is very fast to query, but the
> problem is that I now have over 2.7 million rows, because some of the ranges
> are quite large.
> As the number of ranges could change, this method could be a problem -
> imagine we add a whole A-class range, it would explode into millions of
> rows.
> My question is, is there a more sane way to store this information, while
> still being able to find all the IP ranges that have the given IP in them?
> I've been only dealing with Cassandra for a week or two, so I don't know
> about the inner details of what can be done, but I do have programming
> experience and am not afraid to get my hands dirty, in case it can be solved
> by writing some extension to Cassandra.

So, this is just off the top of my head, and I'm not an expert, but
perhaps this will give you some ideas:

I'm assuming you're talking IPv4.  If it's IPv6, you'll need to do some tweaking

Create a CF with RowKey of AsciiType and Column Names of LongType.
Values are BytesType.  Basically you create one Row for each /8 and
name the row with the network address of that row.  So you'd have:

etc as row keys

Then for every /16 under that, store a single name/value pair where
the name is the inet_aton encoded value of the IP address of the /16
network and the value is a bitmask representing the IP's of the /16.
Set 1 = filter, 0 = don't filter.  Basically the first bit would be
0.1, 255th bit would be 0.255 and the 256th bit 1.0, etc.

So basically you'll have:
255 rows
each row with 255 columns
each column would store 8K bytes (since 2^16/8 = 8K)

Alternatively, you could store 16K columns per row (each column is a
/24) and each column would have 8 bytes.  Off the top of my head I'm
not sure which would be faster, but the first solution would be more
disk space efficient.  If you need to update your bitmasks regularly,
you're probably better off with the second solution.

Wrap a little API around this and you have fast and direct access to
know if a given IP should be filtered or not.

Thanks, these are also good suggestions, but I'll go with Brandon's suggestion first (as it seems to be more suitable for my scenario - I not only need to know if the IP needs to be filtered, but also what range was it found in.

Tamas Marki