incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Anderson <jch...@apache.org>
Subject Re: Not-even-yet-newbie question
Date Fri, 17 Apr 2009 00:27:45 GMT
On Thu, Apr 16, 2009 at 5:01 PM, André Warnier <aw@ice-sa.com> wrote:
> Hi good people on this list.
>
> I was recently at ApacheCON Europe, where I followed the spirited and
> spiritual Introduction to CouchDB by J. Chris Anderson and Jan Lehnardt.  I
> also browsed the CouchDB section on the ASF website. I don't know Erlang,
> although I followed the brief tutorial linked to from the website.  It
> looked simple, which makes me suspect I missed quite a lot.

Glad you enjoyed the talk!

>
> In fact, I have the impression that I missed a whole lot more than Erlang,
> so I thank in advance whowever has the patience to read this and provide
> some answers to my questions.
>
> I very much like the "Relax" motto.
>
> What I am still trying to figure out mainly, is if CouchDB would be an
> appropriate tool for the following.
>
> We basically manage information and documents for other people, as an ASP
> service.  We provide various easy ways for companies to upload their
> electronic documents of all kinds to a dedicated Internet server; we then
> process these documents à la Tikka (but not with Tikka)(extract meta-data
> and content), automatically index them, and store on the one side the
> meta-data and text content in a search engine à la Lucene (but not Lucene),
> and on the other side we store the original electronic document into a
> special passive file structure that we developed, and which has proven
> capable of storing reliably a few million documents so far.  In that file
> structure, each document is identified by a unique "logical number", which
> we store along with the meta-data in the search engine.  (So far in our
> case, once a document is stored, it never changes).
> Then we provide means for the customer to search and find their documents
> through a web interface to the search engine, and to retrieve the
> corresponding original documents.
>
> It works well and is very reliable, but slowly we are getting into a
> management issue due to the volumes of original electronic documents,
> which always increases. That is because our customers never throw away old
> documents, and they give us ever more varied data to handle.
> So we are concerned about increasing volumes to back up, and even more about
> volumes to restore in case something would seriously go wrong.
>
> All the above to indicate that when we ourselves talk about "documents", we
> talk about on the one hand a searchable index (which works very well, takes
> comparatively very little space and which we do not want to change for now),
> and on the other hand, stored corresponding electronic documents (blobs)
> identified and accessible via one single "key".
>
> I would be interested to understand if CouchDB would provide a reliable and
> efficient replacement for our self-developed and self-maintained storage
> structure.
>

This sounds like pretty much exactly CouchDB's sweet spot.

The interface to CouchDB for 99% of users is not Erlang, but plain old
HTTP. See the Futon Ajax administration application for a powerful
example of using CouchDB's REST API.

> The first question is whether the notion of "document" in CouchDB is
> compatible with our own notion of document.  I mean, could I define in
> CouchDB a document as consisting of a single text "key" (a globally unique
> document-id), plus a "blob" of undeterminate size (e.g. a MS-Word document,
> or a PDF, or an image, or a CAD drawing, or an email or whatever). And would
> I then be able to generate for example a search result webpage, where next
> to a document summary I can display a PDF icon, which when clicked retrieves
> the corresponding electronic document from CouchDB and sends it to the
> browser ?

In CouchDB, documents are JSON objects which can consist of arbitrary
key/value pairs as well as deeper nested data.

CouchDB documents also have attachments, which are binary blobs of any
size. Each document can have many attachments. The JSON metadata can
be indexed using CouchDB map reduce views, while attachment data is
not sent to the view server.

For your application you could store each file in CouchDB as an
attachment to a JSON document. The JSON document would hold the
searchable metadata. If you had derived versions (pdf from .doc, etc)
you could also store those as more attachments to the same JSON
document.

The attachments themselves can be served directly from CouchDB. Here's
an example:

http://jchrisa.net/drl/ApacheCon-Talk-Slides/CouchDB-Intro-EU.pdf

This is why I think it's such a great fit for your app.

>
> Another aspect that seems particularly interesting - if I got this right -
> is the self-replicating nature of CouchDB, which would allow us to define
> say 3 "repositories" located in different places, and which would
> automatically synchronise themselves. Yes ?

Almost. CouchDB provides triggered replication, which is the primitive
necessary to accomplish this. Operators are responsible for using
replication as their application sees fit. You could use a cron job to
keep the various databases in sync over time.

>
> I also seem to have understood that if one of these repositories suddenly
> became unavailable because the big one just hit, a document request would
> automatically be satisfied by the next available one in line. Yes ?

You can accomplish this yourself with a simple Apache httpd
configuration directive. CouchDB currently acts as a single node, so
any additional multi-node smarts are up to the operator. We're working
on giving it multi-node smarts, but for now it's up to you. (but
pretty easy)

>
> Would there be some way in CouchDB to store one such document, in some
> logical group containing the original version (say OpenOffice text), along
> with its PDF/A version (which we generate when the document is originally
> stored) and with an image of the first page (ditto), in such a way that by
> using the "main key" plus some additional parameter, I can retrieve
> whichever version I need now ?

See above about derived versions...

>
> Would I need to become proficient in Erlang before I can store a new
> document or retrieve a stored one, or can this be done using some simple
> call from some interface routine in any programming language ?
> (For example, a click on a PDF icon generates a call to a mod_perl add-on
> Apache module, which then retrieves the document from CouchDB and returns it
> to the browser)(perl can "do JSON" or "do XML" e.g.).
>
> To generalise the above question, for what kind of action would I
> necessarily need to know Erlang ?

You'd need Erlang to modify CouchDB itself. 99% of users never see Erlang.

>
> I'll no doubt have more questions if the answers to the above do not
> discourage me, but I promise they will be shorter.
>
> Thanks in advance.
>

No problem, glad you are interested!

-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Mime
View raw message