couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Peeters <nicoli...@gmail.com>
Subject Re: Large document design question (updated)
Date Thu, 24 Feb 2011 08:27:25 GMT
Thanks for your reply. Actually it's either Alt 1. or Alt 2., I guess. I
don't see why I should be combining. I'm really wondering what the best
practice is (I'm leaning toward Alt. 1, by the way). It seems like Alt 2. is
more like hacking the document model to make it look and behave like a
relational model!

Hoping to get some more advice from the experts out there!

Cheers,

Nicolas

On Wed, Feb 23, 2011 at 7:14 PM, Javier Julio <jjfutbol@gmail.com> wrote:

> Nicolas,
>
> Great question. I think what you want here and from what I've learned from
> reading the guide and wiki is a combination of Alternative 1 and 2. While it
> is suggested to do what you have done there are limits and since you are
> hitting those limits that's when the alternative approaches come in and are
> usually best I would think. You might not know if at a later point what
> you're storing will get to big or if multiple users can work with it (think
> comments for a blog post). So Alternative 1 and 2 would be great to start
> with.
>
> So basically you can break it down into 2 different document "types". One
> document with a type of say "website" that just contains the general site
> info and then a second document with a type of "page" that has the page
> content as well as the website id, whether that's a URL or you just use the
> generated id's CouchDB creates.
>
> Interesting considering storing the pages as attachments (Alternative 3).
> No idea if this is beneficial to you in any way so will let others comment
> on that.
>
> Hope this helps.
>
> Ciao!
> Javi
>
> On Feb 23, 2011, at 12:25 PM, Nicolas Peeters wrote:
>
> > Hi CouchDB community,
> >
> > *//Sorry, the previous email was sent too quickly...
> > *
> > I have basically a design "best practices" question. We are using CouchDB
> to
> > store crawled web content. The document is pretty self explanatory, the
> id
> > is the URL and there's a "pages" array that contains all the text from
> the
> > web pages.
> > Potentially, this document can grow very quickly to a large size (> 20
> MB).
> > It seems that we run into issues (
> > https://issues.apache.org/jira/browse/COUCHDB-893) when creating a view
> with
> > objects that are larger than 9 MB (in our case).
> >
> > {
> >   "_id": "http://www.website.com/",
> >   "_rev": "1-33c75795126ff81b0125156b88593cc0",
> >      *"metadata1" : "blabla",
> > **   "metadata2" : "blabla",*
> >   "pages": [
> >       {
> >           "description": "",
> >           "text": "A lot of text comes here....:",
> >           "url": "http://www.website.com/",
> >           "title": "The title of this website /",
> >           "keywords": "",
> >       },
> >       {
> >           "description": "",
> >           "text": "A lot of text comes here....:",
> >           "url": "http://www.website.com/contact/",
> >           "title": "Contact Page",
> >           "keywords": "",
> >       }
> >
> >            // MANY other pages here
> >      ],
> >        "crawlDate": "2011-02-10T12:30:07.416+01:00"
> > }
> >
> > This document structure  is not working very well for us. We are thinking
> > about the following alternatives. We would really appreciate if you could
> > give expert modelling advice.
> >
> > *- Alternative 1)* Create a "page" document where we would have 1 page
> > (description, text, *parent_url *(which would be the _id of the original
> > doc)*,* url,...) per document. The rest of the data contained in the
> > original doc would be duplicated/denormalized. We could then create view
> > that "assembles" all the pages for a given parent_url (which in essence
> > would have the same effect of the original implementation).
> >
> > *-* *Alternative 2)* Model in One to Many fashion as described here:
> > http://wiki.apache.org/couchdb/EntityRelationship
> > *
> > - Alternative 3) *Keep the design as is, but store the "page" content as
> > attachment when we store the object. (Subquestion: would that influence
> the
> > size of the doc?)
> >
> > *- Alternative 4) *Keep the design as is and change some settings in the
> > configuration that I don't know about.
> > *
> > *Subquestion: any particular design reason why this issue (
> > https://issues.apache.org/jira/browse/COUCHDB-893) is occuring? Any good
> > workaround (apart from recompilation!). Any ETC when this will be fixed
> in a
> > release version?
> >
> > Thank you for your help and advice.
> >
> > Nicolas
> >
> > PS: The reason that we need a view is that we are using Document Update
> > handler <http://wiki.apache.org/couchdb/Document_Update_Handlers> to do
> > incremental updates, view requires some kind of view. The incremental
> > updates works fine for normal sizes documents.
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message