couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Javier Julio <jjfut...@gmail.com>
Subject Re: Large document design question (updated)
Date Wed, 23 Feb 2011 18:14:34 GMT
Nicolas,

Great question. I think what you want here and from what I've learned from reading the guide
and wiki is a combination of Alternative 1 and 2. While it is suggested to do what you have
done there are limits and since you are hitting those limits that's when the alternative approaches
come in and are usually best I would think. You might not know if at a later point what you're
storing will get to big or if multiple users can work with it (think comments for a blog post).
So Alternative 1 and 2 would be great to start with.

So basically you can break it down into 2 different document "types". One document with a
type of say "website" that just contains the general site info and then a second document
with a type of "page" that has the page content as well as the website id, whether that's
a URL or you just use the generated id's CouchDB creates.

Interesting considering storing the pages as attachments (Alternative 3). No idea if this
is beneficial to you in any way so will let others comment on that.

Hope this helps.

Ciao!
Javi

On Feb 23, 2011, at 12:25 PM, Nicolas Peeters wrote:

> Hi CouchDB community,
> 
> *//Sorry, the previous email was sent too quickly...
> *
> I have basically a design "best practices" question. We are using CouchDB to
> store crawled web content. The document is pretty self explanatory, the id
> is the URL and there's a "pages" array that contains all the text from the
> web pages.
> Potentially, this document can grow very quickly to a large size (> 20 MB).
> It seems that we run into issues (
> https://issues.apache.org/jira/browse/COUCHDB-893) when creating a view with
> objects that are larger than 9 MB (in our case).
> 
> {
>   "_id": "http://www.website.com/",
>   "_rev": "1-33c75795126ff81b0125156b88593cc0",
>      *"metadata1" : "blabla",
> **   "metadata2" : "blabla",*
>   "pages": [
>       {
>           "description": "",
>           "text": "A lot of text comes here....:",
>           "url": "http://www.website.com/",
>           "title": "The title of this website /",
>           "keywords": "",
>       },
>       {
>           "description": "",
>           "text": "A lot of text comes here....:",
>           "url": "http://www.website.com/contact/",
>           "title": "Contact Page",
>           "keywords": "",
>       }
> 
>            // MANY other pages here
>      ],
>        "crawlDate": "2011-02-10T12:30:07.416+01:00"
> }
> 
> This document structure  is not working very well for us. We are thinking
> about the following alternatives. We would really appreciate if you could
> give expert modelling advice.
> 
> *- Alternative 1)* Create a "page" document where we would have 1 page
> (description, text, *parent_url *(which would be the _id of the original
> doc)*,* url,...) per document. The rest of the data contained in the
> original doc would be duplicated/denormalized. We could then create view
> that "assembles" all the pages for a given parent_url (which in essence
> would have the same effect of the original implementation).
> 
> *-* *Alternative 2)* Model in One to Many fashion as described here:
> http://wiki.apache.org/couchdb/EntityRelationship
> *
> - Alternative 3) *Keep the design as is, but store the "page" content as
> attachment when we store the object. (Subquestion: would that influence the
> size of the doc?)
> 
> *- Alternative 4) *Keep the design as is and change some settings in the
> configuration that I don't know about.
> *
> *Subquestion: any particular design reason why this issue (
> https://issues.apache.org/jira/browse/COUCHDB-893) is occuring? Any good
> workaround (apart from recompilation!). Any ETC when this will be fixed in a
> release version?
> 
> Thank you for your help and advice.
> 
> Nicolas
> 
> PS: The reason that we need a view is that we are using Document Update
> handler <http://wiki.apache.org/couchdb/Document_Update_Handlers> to do
> incremental updates, view requires some kind of view. The incremental
> updates works fine for normal sizes documents.


Mime
View raw message