perl-modperl mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Perrin Harkins <per...@elem.com>
Subject Re: changing global data strategy
Date Wed, 08 Mar 2006 19:22:02 GMT
On Tue, 2006-03-07 at 21:05 -0800, Will Fould wrote:

> we have a tool that loads a huge store of data (25-50Mb+) from a
> database into many perl hashes at start up: each session needs
> access to all these data but it would be prohibitive to use mysql or
> another databases for multiple, large lookups (and builds), at each
> session:  there are quite a few structures, each are very big.

This gets asked about a lot, so I'm going to just dump all my ideas
about it here and then I'll have something to refer back to later.

I can think of four possible approaches for this kind of situation (in
order of difficulty):

1) Find a way to organize it into a database.  I know, everyone thinks
their data is special and it can't be done, but a database can be used
for most data storage if you think creatively about it.  Databases can
be very flexible if you are willing to denormalize things a bit.

2) Use an external cache server like memcached.  This will require you
to figure out how to split your data access into hash-like patterns.  It
will not be anywhere near as fast as in-memory lookups that you use now
though.  That's the price you pay for scaling across machines.  You also
need to be aware that memcached is a cache, not a database, so it can't
be the final destination for data changes.  It also can drop data when
it gets full or when a server goes down.

3) Use local caches with networked updates.  You can use something like
BerkeleyDB which performs really well on a local machine (significantly
better than any networked daemon) to store the data.  If you have enough
RAM to use a big cache with it, the data will all be in memory anyway.
You would still need to organize your data access into hashes.  The
other part of this is handling updates, which you can do by running a
simple daemon on each machine that listens for updates and then writing
to a master daemon that tells all the others.  Alternatively, you could
use something like Spread, which does reliable multicast messaging, to
distribute the updates.

4) Write your own custom query daemon.  Most likely this would be a
multi-threaded server written in C/C++ which loads all the data and has
a protocol for querying and changing it.  This will be lots of work and
you'll have to do a very good job of it to make it do better than
existing fast database servers like MySQL.  You do get to create the
exact data structures and indexes that you need though.  I have seen a
search engine written this way, with great success, but it took a lot of
work by an expert C++ programmer to do it.  You might be able to
simplify a little by writing it as a custom data type for PostgreSQL,
but I have no idea how hard that is or how it performs.

One thing that you can't do to solve it is rewrite in a language with
better thread support like Java, since you would still be stuck when you
try to run it on multiple servers.

Hope that helps outline your options a little.

- Perrin


Mime
View raw message