Return-Path: Delivered-To: apmail-jackrabbit-users-archive@locus.apache.org Received: (qmail 96902 invoked from network); 23 May 2007 14:33:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 23 May 2007 14:33:27 -0000 Received: (qmail 92139 invoked by uid 500); 23 May 2007 14:33:31 -0000 Delivered-To: apmail-jackrabbit-users-archive@jackrabbit.apache.org Received: (qmail 92115 invoked by uid 500); 23 May 2007 14:33:31 -0000 Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@jackrabbit.apache.org Delivered-To: mailing list users@jackrabbit.apache.org Received: (qmail 92106 invoked by uid 99); 23 May 2007 14:33:31 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 May 2007 07:33:31 -0700 X-ASF-Spam-Status: No, hits=2.9 required=10.0 tests=HTML_10_20,HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of cris.daniluk@gmail.com designates 209.85.146.178 as permitted sender) Received: from [209.85.146.178] (HELO wa-out-1112.google.com) (209.85.146.178) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 May 2007 07:33:24 -0700 Received: by wa-out-1112.google.com with SMTP id m34so57035wag for ; Wed, 23 May 2007 07:33:03 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=Q7pOrb0nomZu86HP5PRAnpdNQ7QHoyBckMSelfjsBHeIMFXXxnWXurRtRgNHMRvv8rS6VohTKhgOJUC0h1Ymu6xz0otsJY7m2r5RLUa38zCQdAMcxRX+BCJ5KX0COkqwte7VAQJDCto05F95v1r/iTDuKE/wakNyH2A59keCWpo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=sR/UXk72UNZiHnES2yq8ZIY2ebWIOyL+CBB8hJLHkULzFD+oOiqDqO6CuwCC1QqrJN7tcNLgmHXd5P91q+InA8ESQNeKfUfEYAh33HzYLcEBU/UxoVw3cyzs1oKQvmfbb23ndq008ndRepYYZa2D9v9t1kdNQDGGbB4PrimadeY= Received: by 10.115.15.1 with SMTP id s1mr327560wai.1179930779586; Wed, 23 May 2007 07:32:59 -0700 (PDT) Received: by 10.115.78.19 with HTTP; Wed, 23 May 2007 07:32:59 -0700 (PDT) Message-ID: Date: Wed, 23 May 2007 10:32:59 -0400 From: "Cris Daniluk" To: users@jackrabbit.apache.org Subject: Re: workspace / repository scalability In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_97460_12313902.1179930779535" References: X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_97460_12313902.1179930779535 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline I'm expecting that we will have about 10-15 nodes per "document" in most cases, though some could have 35-50. When you say adequate hierarchical structure, does this imply that we should try to keep our tree "bushy"? Really, because we rely on the external search engine for location, we only direct query on sequential ID at the database. Should a partitioning strategy be used? If so, what sort of depth might we aim for? Also, what sort of persistence store did you use in these tests? I would assume, among other things, that XML is a bad choice, for example :) I have been finding some evidence that people are using jackrabbit in these situations successfully, but not a lot of information on how they are handling this, backup, etc. Thanks! Cris On 5/23/07, David Nuescheler wrote: > > hi cris, > > thanks for your email. > > we ran a couple of test that are in the lower terabyte range from an > overall > data persepctive, but we noticed that the number of nodes and a adequate > hierarchical structure is much more relevant than the overall size of the > data. > in our tests we went beyond 50m files (100m nodes) per workspace > without running into substantial issues. > > so generally, i think that storing single digital millions of records > should > not be an issue at all, and also storing all the data in a single > workspace > should be feasible. however, since jackrabbit scales on a > per-workspace basis, you can always split up your data into multiple > workspaces if you should feel like you could reach certain per-workspace > limitations. > > regards, > david > > On 5/22/07, Cris Daniluk wrote: > > Hello, > > > > I've been considering JackRabbit as a potential replacement for a > > traditional RDBMS content repository we've been using. Currently, the > > metadata is very "static" from document to document. However, we want to > > start bringing in arbitrary types of documents. Each document will > specify > > its own metadata, and may map "core" metadata back to a set of common > > fields. It really seems like a natural fit for JCR. > > > > I don't really need search (search services will be provided by a > separately > > synchronized and already existing index), but I do need content > scalability. > > We have about 500GB worth of binary data and 1GB of associated text > metadata > > right now (about 200k records). Ideally, the repository would contain > the > > binary data as the primary node, rather than merely referencing it. > However, > > this already large data set will probably grow up to 2-3TB in the next > year > > and potentially way beyond that, with millions of records. > > > > From browsing the archives, it seems like this would be well above and > > beyond the typical repository size. Has anybody used Jackrabbit with > this > > volume of data? It is pretty difficult to set up a test, so I'm left to > rely > > on similar user experience. Would clustering, workspace partitioning, > etc > > handle the volume we'd be expected to produce? > > > > Thanks for the help, > > > > Cris > > > ------=_Part_97460_12313902.1179930779535--