couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zachary Zolton <zachary.zol...@gmail.com>
Subject Re: Large document design question (updated)
Date Thu, 24 Feb 2011 15:51:47 GMT
Nicolas,

Storing that much text in your documents will add a lot of overhead to
your view functions—or any of the other JavaScript design doc
functions you may want to use.

Therefore, if you don't need to access the raw text of each page to
create your views, you may want try storing them as attachments to
your web site document. This will result in smaller JSON strings
getting marshalled over to the JavaScript view server, needing to be
parsed.

As for answering what the "best practice" is for how to model
one-to-many relationship, it totally depends on what kind of update
scenarios and methods of access your application requires.


Cheers,

Zach

On Thu, Feb 24, 2011 at 2:27 AM, Nicolas Peeters <nicolists@gmail.com> wrote:
> Thanks for your reply. Actually it's either Alt 1. or Alt 2., I guess. I
> don't see why I should be combining. I'm really wondering what the best
> practice is (I'm leaning toward Alt. 1, by the way). It seems like Alt 2. is
> more like hacking the document model to make it look and behave like a
> relational model!
>
> Hoping to get some more advice from the experts out there!
>
> Cheers,
>
> Nicolas
>
> On Wed, Feb 23, 2011 at 7:14 PM, Javier Julio <jjfutbol@gmail.com> wrote:
>
>> Nicolas,
>>
>> Great question. I think what you want here and from what I've learned from
>> reading the guide and wiki is a combination of Alternative 1 and 2. While it
>> is suggested to do what you have done there are limits and since you are
>> hitting those limits that's when the alternative approaches come in and are
>> usually best I would think. You might not know if at a later point what
>> you're storing will get to big or if multiple users can work with it (think
>> comments for a blog post). So Alternative 1 and 2 would be great to start
>> with.
>>
>> So basically you can break it down into 2 different document "types". One
>> document with a type of say "website" that just contains the general site
>> info and then a second document with a type of "page" that has the page
>> content as well as the website id, whether that's a URL or you just use the
>> generated id's CouchDB creates.
>>
>> Interesting considering storing the pages as attachments (Alternative 3).
>> No idea if this is beneficial to you in any way so will let others comment
>> on that.
>>
>> Hope this helps.
>>
>> Ciao!
>> Javi
>>
>> On Feb 23, 2011, at 12:25 PM, Nicolas Peeters wrote:
>>
>> > Hi CouchDB community,
>> >
>> > *//Sorry, the previous email was sent too quickly...
>> > *
>> > I have basically a design "best practices" question. We are using CouchDB
>> to
>> > store crawled web content. The document is pretty self explanatory, the
>> id
>> > is the URL and there's a "pages" array that contains all the text from
>> the
>> > web pages.
>> > Potentially, this document can grow very quickly to a large size (> 20
>> MB).
>> > It seems that we run into issues (
>> > https://issues.apache.org/jira/browse/COUCHDB-893) when creating a view
>> with
>> > objects that are larger than 9 MB (in our case).
>> >
>> > {
>> >   "_id": "http://www.website.com/",
>> >   "_rev": "1-33c75795126ff81b0125156b88593cc0",
>> >      *"metadata1" : "blabla",
>> > **   "metadata2" : "blabla",*
>> >   "pages": [
>> >       {
>> >           "description": "",
>> >           "text": "A lot of text comes here....:",
>> >           "url": "http://www.website.com/",
>> >           "title": "The title of this website /",
>> >           "keywords": "",
>> >       },
>> >       {
>> >           "description": "",
>> >           "text": "A lot of text comes here....:",
>> >           "url": "http://www.website.com/contact/",
>> >           "title": "Contact Page",
>> >           "keywords": "",
>> >       }
>> >
>> >            // MANY other pages here
>> >      ],
>> >        "crawlDate": "2011-02-10T12:30:07.416+01:00"
>> > }
>> >
>> > This document structure  is not working very well for us. We are thinking
>> > about the following alternatives. We would really appreciate if you could
>> > give expert modelling advice.
>> >
>> > *- Alternative 1)* Create a "page" document where we would have 1 page
>> > (description, text, *parent_url *(which would be the _id of the original
>> > doc)*,* url,...) per document. The rest of the data contained in the
>> > original doc would be duplicated/denormalized. We could then create view
>> > that "assembles" all the pages for a given parent_url (which in essence
>> > would have the same effect of the original implementation).
>> >
>> > *-* *Alternative 2)* Model in One to Many fashion as described here:
>> > http://wiki.apache.org/couchdb/EntityRelationship
>> > *
>> > - Alternative 3) *Keep the design as is, but store the "page" content as
>> > attachment when we store the object. (Subquestion: would that influence
>> the
>> > size of the doc?)
>> >
>> > *- Alternative 4) *Keep the design as is and change some settings in the
>> > configuration that I don't know about.
>> > *
>> > *Subquestion: any particular design reason why this issue (
>> > https://issues.apache.org/jira/browse/COUCHDB-893) is occuring? Any good
>> > workaround (apart from recompilation!). Any ETC when this will be fixed
>> in a
>> > release version?
>> >
>> > Thank you for your help and advice.
>> >
>> > Nicolas
>> >
>> > PS: The reason that we need a view is that we are using Document Update
>> > handler <http://wiki.apache.org/couchdb/Document_Update_Handlers> to do
>> > incremental updates, view requires some kind of view. The incremental
>> > updates works fine for normal sizes documents.
>>
>>
>

Mime
View raw message