Return-Path: Delivered-To: apmail-incubator-couchdb-user-archive@locus.apache.org Received: (qmail 55282 invoked from network); 24 Nov 2008 17:25:22 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 24 Nov 2008 17:25:22 -0000 Received: (qmail 51478 invoked by uid 500); 24 Nov 2008 17:25:31 -0000 Delivered-To: apmail-incubator-couchdb-user-archive@incubator.apache.org Received: (qmail 51443 invoked by uid 500); 24 Nov 2008 17:25:31 -0000 Mailing-List: contact couchdb-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: couchdb-user@incubator.apache.org Delivered-To: mailing list couchdb-user@incubator.apache.org Received: (qmail 51432 invoked by uid 99); 24 Nov 2008 17:25:31 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Nov 2008 09:25:31 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tpherndon@gmail.com designates 74.125.46.153 as permitted sender) Received: from [74.125.46.153] (HELO yw-out-1718.google.com) (74.125.46.153) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Nov 2008 17:24:04 +0000 Received: by yw-out-1718.google.com with SMTP id 5so904002ywr.0 for ; Mon, 24 Nov 2008 09:24:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type:content-transfer-encoding :content-disposition; bh=XiH43DtDC9pi9YxCHUUoZ/sKZk3ATJF8m1VTGqHDu2Y=; b=m0gkFgzG+0E6k/LYmUChTCGN+uOT74cVah6Hro624Y4PHoN6v+AZoGXv4zsq+wXyO5 Vev64BFVCv+6LyhqAIwT5Cy+MEz09kBh0SRsGg1zgfzBwxAj5hCEpydWoIaPl/t5xzkj YZH7GdGI1gSkHaZqzrkBZ94es2ggByZ3jV8iU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type :content-transfer-encoding:content-disposition; b=jYN9rMRibPX8LU/mUkiwAZjKp2FnSCZNQKbGEKngUsHgUbsF9Oq7l84hCUImJAhOO0 fr5Rt9RkCfHS+IGiaseKCJ8EP1bqMA6tgBEKHGPGrhMm9gqZfMqJMv09y+MWPBmxAC1u +waULnnk5MakUWSZE3B4+MMxqag08kaGwrAck= Received: by 10.142.199.10 with SMTP id w10mr1490222wff.94.1227547489475; Mon, 24 Nov 2008 09:24:49 -0800 (PST) Received: by 10.142.43.16 with HTTP; Mon, 24 Nov 2008 09:24:49 -0800 (PST) Message-ID: <3a48a11f0811240924r38e0f5aeo73f1fc28449a420@mail.gmail.com> Date: Mon, 24 Nov 2008 12:24:49 -0500 From: "Peter Herndon" To: couchdb-user@incubator.apache.org Subject: Evaluating CouchDB MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline X-Virus-Checked: Checked by ClamAV on apache.org Hi all, I'm in the process of looking at various technologies to implement a "digital object repository". The concept, and our current implementation, come from http://www.fedora-commons.org/. A digital object repository is a store for managing an object made up of an XML file that describes the object's structure and includes object metadata, plus one or more binary files as various datastreams. As an example, take an image object: the FOXML (Fedora Object XML) file details the location and kind of the datastreams, includes Dublin Core and MODS metadata in namespaces, and includes some RDF/XML that describes the object's relationship to other objects (e.g. isMemberOf collection). The datastreams for the image object include a thumbnail-sized image, a screen-sized image (roughly 300 x 400), and the original image in its full resolution. Images are not the only content type handled by the software, pretty much anything can be managed by the repository, PDFs, audio, video, XML, text, MS Office documents, whatever you want. The repository software provides access control, and provides APIs (both SOAP and, to a limited extent, REST) to manage objects, their metadata, and their binary datastreams. The XML is stored locally on the file system, and the datastreams can be either stored locally, or referenced by HTTP. The problem with the software is that it's got a great architectural vision, but the implementation is of variable quality. There are lots of different little pieces, and many of them are not written with good best practices in mind, or they have zero exposure to real-world environments and the code reflects that, etc. Plus, my days of slinging Java and enjoying it are long since past. Our current implementation consists of a Java front end, plus the repository on the back end. We have approximately 40GB of images stored in the repository at the moment, from our pilot project. We have four other departments wanting to use the software, either in a group repository or in a dedicated repository of their own. The most intimidating project is one that currently has 20+ TB of images, and anticipates creating and ingesting 240+ GB more per day, when in full swing. We don't really expect to ingest that much data directly into the repository, as our network would be a major bottleneck -- the lab that creates the data is physically located a good distance away from our data center, and those images are already being transferred once to a file share at the data center. If we continue with our current back-end, we'll likely stick a web server in front of the file share, and use the HTTP reference, rather than transferring them again to the repository's storage. Anyway, that's my current use case, and my next use case. I know that CouchDB isn't finished yet, and hasn't been optimized yet, but does anyone have any opinions on whether CouchDB would be a reasonable fit for managing the metadata associated with each object? And, likewise, would CouchDB be a reasonable fit for managing the binary datastreams? Would it be practical to store the datastreams in CouchDB itself, and up to what size limit/throughput limit? Would it be better to store the datastreams externally and use CouchDB to manage the metadata and access control? Also, looking down the road, are there plans for CouchDB's development that would improve its fitness for this purpose in the future? Thanks very much for any insight you can share, ---Peter Herndon