Mailing-List: contact derby-user-help@db.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Derby Discussion" <derby-user@db.apache.org>
Received-SPF: pass (athena.apache.org: local policy)
Date: Tue, 15 Apr 2008 13:03:46 -0700
From: Rick Hillegas <Richard.Hillegas@Sun.COM>
Subject: Re: Managing many databases
In-reply-to: <9A1A5FFB-3B42-4663-BE3C-DD8DF4CBA623@sixfriedrice.com>
Sender: Richard.Hillegas@Sun.COM
To: Derby Discussion <derby-user@db.apache.org>
Message-id: <48050A22.8040901@sun.com>
MIME-version: 1.0
Content-type: text/plain; format=flowed; charset=ISO-8859-1
Content-transfer-encoding: 7BIT
References: <9A1A5FFB-3B42-4663-BE3C-DD8DF4CBA623@sixfriedrice.com>
User-Agent: Thunderbird 2.0.0.12 (Macintosh/20080213)

Hi Geoff,

You have asked a lot of interesting questions. I will try to give you 
some feedback on some of your questions. Hopefully others can provide 
more information. Please see my responses inline...


Six Fried Rice wrote:
> I'm a first-time poster so I hope I'm following protocol here. I 
> searched the MarkMail archive and I don't think this is a FAQ.
>
> We're considering using derby in an atypical situation, and I'm 
> looking for some general feedback on how best to proceed. The 
> application processes very large XML reports (100MB to 2GB) for our 
> customers, and then presents the data in an explorable fashion through 
> the browser. A typical report might produce around 500,000 records, 
> with up to maybe 2 million records or so at the (rare) top end. To 
> keep this under control, we are using this model:
>
> 1: The user interacts with our web site to set up an account and 
> prepare to process a report.
> 2: When they opt to process a report, a Java WebStart application 
> launches and processes the report with an embedded derby.
> 3: When the processing is complete, the derby database is jarred and 
> uploaded to the server.
> 4: At that point, all the data is completely read-only.
>
> All of this is largely working (less a few bugs) and we're very happy 
> with the performance and the notion that the heavy lifting happens on 
> the client side.
>
> Now I'm trying to decide how we will handle the server side database 
> interaction if we continue with this model. In the simplest case, I'd 
> like to interact directly with the user's individual derby databases 
> (one per report). This has several advantages:
>
> 1: We don't have any time-consuming import process to put all that 
> data into a centralized database
> 2: We get built-in partitioning of the data on the server side which 
> is good news for scalability
> 3: The data  model is somewhat complex and join-heavy, and I suspect 
> several smaller databases will, in general, perform better than one 
> very large database with hundreds of millions of records
> 4: Cleanup is a breeze: to remove a report we just whack a directory 
> on the file system
>
> But I'm not sure how best to actually manage all these databases. I 
> suspect we will have on the order of 1000 databases in play, with 
> maybe 20 of those being actively used at a single busy time. It is 
> conceivable that we will have more than this, depending on the success 
> of the system. So I guess I'm looking for any general insights, plus 
> answers to a few concrete questions:
>
> 1: What are the performance characteristics of using zipped or jarred 
> DBs? It doesn't bother me to unzip them, but I saw this option in the 
> documentation and I was curious. Can these jars be in arbitrary 
> locations on the file system, and be connected to ad-hoc? Can a derby 
> server provide access to a jarred database at an arbitrary filesystem 
> location?
Please take a look at the section titled "Accessing a read-only database 
in a zip/jar file" in the Derby Developer's Guide: 
http://db.apache.org/derby/docs/10.3/devguide/ The jars can live 
anywhere in the file system or on the classpath.
>
> 2: Are there any performance concerns with having many databases in a 
> single derby install? Would it be better to run one derby server, with 
> 1000 databases, or run multiple derby servers on the same hardware and 
> partition the databases across them? I'm not looking for exact 
> numbers, since they obviously depend on a lot of factors. But in 
> general, can I load a ton of databases into derby server and be OK? 
> (We have no problem throwing additional hardware at this system as 
> needed.)
Hard to say. Most of our performance work has measured the performance 
of many clients hammering a single database. I don't know where Derby 
maxes  out in its ability to saturate multiple processors when you are 
running an application against many databases. I think that against a 
single database, there is a limit (4?) to the number of processors which 
a Derby server can keep busy. That may or may not scale up if your 
server is managing more than one database.
>
> 3: Can derby server discover new databases if I simply copy (or 
> symlink?) a derby database directory to its DERBY_HOME? Or do the 
> databases need to be *created* programmatically through JDBC?
Derby has no heuristic for knowing where to look for databases.  I think 
that database discovery has to be done by your application.  Basically, 
you need to locate the database via a JDBC connection URL.
>
> 4: Anybody have any experience with rails and derby? I see a few hints 
> on line that people are doing it but I'm not too certain of the 
> stability and details on what is supported. I'll have to write my own 
> connection pooling and switching code in rails, which I don't think 
> will be too tough. But an alternative would be to build a JEE-based 
> web service to manage the derby interaction, and then have my rails 
> application interact with that data server, if Rails/Derby is not a 
> reliable or well-performing option.
Sorry, I'm out of my league here.

Hope this is a little helpful,
-Rick
>
> I know this is an open-ended question. I appreciate any time and 
> insight any of you may offer :)
>
> Thanks,
>
> Geoff