incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tamas Marki <tma...@gmail.com>
Subject Re: Storing and querying IP ranges in Cassandra
Date Wed, 02 Nov 2011 08:14:34 GMT
On Tue, Nov 1, 2011 at 7:28 PM, Aaron Turner <synfinatic@gmail.com> wrote:

> On Tue, Nov 1, 2011 at 9:17 AM, Tamas Marki <tmarki@gmail.com> wrote:
> > Hello,
> >
> > I'm new to the list and also to Cassandra. I found it when I was
> searching
> > for something to replace our busy mysql server.
> >
> > One of the things we use the server for is filtering IPs based on a list
> of
> > IP ranges. These ranges can be small and big, and there are about 50k of
> > them in the database.
> >
> > In mysql this is pretty quick: they are stored as integers, and the query
> > basically looks like (say ip is the ip we want to find the all the ranges
> > for):
> >
> > select range from rangelist where ip_start<=ip and ip_end>=ip;
> >
> > I tried to move this schema to Cassandra, but it turned out to be very
> slow,
> > even with indexes on both columns. Since I also had to have an EQ
> expression
> > in the query, I added an indexed text field which was the same for all
> rows,
> > so the query in cassandra was something like this:
> >
> > select range from rangelist where type='ip' and ip_start<=ip and
> ip_end>=ip;
> >
> > This was very slow, and I imagine it is because it has to scan through
> all
> > the rows, making the index useless.
> >
> > The second thing I tried was to just expand the ranges and store
> individual
> > IPs as the keys to a column family. This is very fast to query, but the
> > problem is that I now have over 2.7 million rows, because some of the
> ranges
> > are quite large.
> >
> > As the number of ranges could change, this method could be a problem -
> > imagine we add a whole A-class range, it would explode into millions of
> > rows.
> >
> > My question is, is there a more sane way to store this information, while
> > still being able to find all the IP ranges that have the given IP in
> them?
> >
> > I've been only dealing with Cassandra for a week or two, so I don't know
> > about the inner details of what can be done, but I do have programming
> > experience and am not afraid to get my hands dirty, in case it can be
> solved
> > by writing some extension to Cassandra.
>
> So, this is just off the top of my head, and I'm not an expert, but
> perhaps this will give you some ideas:
>
> I'm assuming you're talking IPv4.  If it's IPv6, you'll need to do some
> tweaking
>
> Create a CF with RowKey of AsciiType and Column Names of LongType.
> Values are BytesType.  Basically you create one Row for each /8 and
> name the row with the network address of that row.  So you'd have:
>
> 1.0.0.0
> 2.0.0.0
> 3.0.0.0
>
> etc as row keys
>
> Then for every /16 under that, store a single name/value pair where
> the name is the inet_aton encoded value of the IP address of the /16
> network and the value is a bitmask representing the IP's of the /16.
> Set 1 = filter, 0 = don't filter.  Basically the first bit would be
> 0.1, 255th bit would be 0.255 and the 256th bit 1.0, etc.
>
> So basically you'll have:
> 255 rows
> each row with 255 columns
> each column would store 8K bytes (since 2^16/8 = 8K)
>
> Alternatively, you could store 16K columns per row (each column is a
> /24) and each column would have 8 bytes.  Off the top of my head I'm
> not sure which would be faster, but the first solution would be more
> disk space efficient.  If you need to update your bitmasks regularly,
> you're probably better off with the second solution.
>
> Wrap a little API around this and you have fast and direct access to
> know if a given IP should be filtered or not.
>
>
Thanks, these are also good suggestions, but I'll go with Brandon's
suggestion first (as it seems to be more suitable for my scenario - I not
only need to know if the IP needs to be filtered, but also what range was
it found in.


-- 
Tamas Marki

Mime
View raw message