incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Peeters <nicoli...@gmail.com>
Subject Large document design question (updated)
Date Wed, 23 Feb 2011 17:25:30 GMT
Hi CouchDB community,

*//Sorry, the previous email was sent too quickly...
*
I have basically a design "best practices" question. We are using CouchDB to
store crawled web content. The document is pretty self explanatory, the id
is the URL and there's a "pages" array that contains all the text from the
web pages.
Potentially, this document can grow very quickly to a large size (> 20 MB).
It seems that we run into issues (
https://issues.apache.org/jira/browse/COUCHDB-893) when creating a view with
objects that are larger than 9 MB (in our case).

{
   "_id": "http://www.website.com/",
   "_rev": "1-33c75795126ff81b0125156b88593cc0",
      *"metadata1" : "blabla",
**   "metadata2" : "blabla",*
   "pages": [
       {
           "description": "",
           "text": "A lot of text comes here....:",
           "url": "http://www.website.com/",
           "title": "The title of this website /",
           "keywords": "",
       },
       {
           "description": "",
           "text": "A lot of text comes here....:",
           "url": "http://www.website.com/contact/",
           "title": "Contact Page",
           "keywords": "",
       }

            // MANY other pages here
      ],
        "crawlDate": "2011-02-10T12:30:07.416+01:00"
}

This document structure  is not working very well for us. We are thinking
about the following alternatives. We would really appreciate if you could
give expert modelling advice.

*- Alternative 1)* Create a "page" document where we would have 1 page
(description, text, *parent_url *(which would be the _id of the original
doc)*,* url,...) per document. The rest of the data contained in the
original doc would be duplicated/denormalized. We could then create view
that "assembles" all the pages for a given parent_url (which in essence
would have the same effect of the original implementation).

*-* *Alternative 2)* Model in One to Many fashion as described here:
http://wiki.apache.org/couchdb/EntityRelationship
*
- Alternative 3) *Keep the design as is, but store the "page" content as
attachment when we store the object. (Subquestion: would that influence the
size of the doc?)

*- Alternative 4) *Keep the design as is and change some settings in the
configuration that I don't know about.
*
*Subquestion: any particular design reason why this issue (
https://issues.apache.org/jira/browse/COUCHDB-893) is occuring? Any good
workaround (apart from recompilation!). Any ETC when this will be fixed in a
release version?

Thank you for your help and advice.

Nicolas

PS: The reason that we need a view is that we are using Document Update
handler <http://wiki.apache.org/couchdb/Document_Update_Handlers> to do
incremental updates, view requires some kind of view. The incremental
updates works fine for normal sizes documents.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message