Return-Path: X-Original-To: apmail-couchdb-user-archive@www.apache.org Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 397D7C958 for ; Fri, 25 May 2012 15:32:34 +0000 (UTC) Received: (qmail 76344 invoked by uid 500); 25 May 2012 15:32:32 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 76286 invoked by uid 500); 25 May 2012 15:32:32 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 76278 invoked by uid 99); 25 May 2012 15:32:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 May 2012 15:32:32 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ziggythehamster@gmail.com designates 209.85.214.52 as permitted sender) Received: from [209.85.214.52] (HELO mail-bk0-f52.google.com) (209.85.214.52) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 May 2012 15:32:27 +0000 Received: by bkcjc3 with SMTP id jc3so1122658bkc.11 for ; Fri, 25 May 2012 08:32:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=y7+sTp0Vq92WznRNzph5ef9T3+GdmwIPcZuZASVqWmo=; b=auNQHYflsta51HnoL4VVMxhoW4ZrmWPiu0cRXVRUxSMWSf1JgRUUj8dxLm50q+fkbD 2t2Xz8nS9iyOVKlugjCQP1SNSNoDSWyuyhn6x/GHZUPgkoHmqXZ5TsFNryCegT65pyFr CxrW9ThYxS3uPi/f2NwxpLThtB7VGBFRkgLwWqCYRwK0P4o9NGFroFHXc1L5ks0hOMC/ oWuGK28smxM2mfJQ718NPrEaZmtTU0zg0uUS9jGgl0g8bwU5XXVM9zrbz03rGkHM4riB R5esXM8ulv5yMY3EREX5592eUQXC5HPceQYVZyEMS1NVIvuCOLyXKDOcokKM/66k3SAa se7g== MIME-Version: 1.0 Received: by 10.204.149.208 with SMTP id u16mr1624935bkv.81.1337959926285; Fri, 25 May 2012 08:32:06 -0700 (PDT) Sender: ziggythehamster@gmail.com Received: by 10.205.39.130 with HTTP; Fri, 25 May 2012 08:32:06 -0700 (PDT) Received: by 10.205.39.130 with HTTP; Fri, 25 May 2012 08:32:06 -0700 (PDT) In-Reply-To: References: <08E52809-C962-4E9C-AFB8-397EA201580E@utt.fr> Date: Fri, 25 May 2012 10:32:06 -0500 X-Google-Sender-Auth: Kvh9wYADRx2J9hRVCr7Xcm74yDs Message-ID: Subject: RE: Am I doing something fundamentally wrong? From: Keith Gable To: user@couchdb.apache.org Content-Type: multipart/alternative; boundary=0015175df0522144e304c0de1298 X-Virus-Checked: Checked by ClamAV on apache.org --0015175df0522144e304c0de1298 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable All of our documents are at most 500 KB. Like others have said, serialization and deserialization is going to take a while with 10MB documents. Definitely don't emit them in a map :-) On May 25, 2012 10:09 AM, "Mike Kimber" wrote: > Keith, > > Thanks for the reference point. From what I can tell the issue is not the > number of the documents its the size, we only have 90K of them but they a= re > on average 0.5MB in size with 600+ of them being over 10MB. How big is on= e > of your documents? > > Mike > > -----Original Message----- > From: ziggythehamster@gmail.com [mailto:ziggythehamster@gmail.com] On > Behalf Of Keith Gable > Sent: 25 May 2012 16:00 > To: user@couchdb.apache.org > Subject: RE: Am I doing something fundamentally wrong? > > View generation only takes a second for me... my data set is half a GB or > so and it takes up around 800MB including about a dozen views. > > One thing to remember is never to emit whole documents or doc IDs. Those > are included for free, so emitting a doc or doc id is a waste of CPU cycl= es > and space. > On May 25, 2012 9:44 AM, "Mike Kimber" wrote: > > > :-) e-mail is not a very good form of communication; apologies. As I sa= id > > in my original post we live in a world of "clever" work around's and t= he > > map's you are referring to are my "clever" work around for the time it > > takes to build views against our data set so that I can do ad-hoc > > analytics/information discover on our data set. > > > > I place the doc._id in the emit KEY as its guaranteed unique to each > > document. I create a map with header information in and then I create a > > detailed map out of a array that's in my document, this has the same > > document doc._id as its KEY. I then use the Luciddb couchdb connector ( > > https://github.com/dynamobi/conn-couchdb ) to pull the two map/views > into > > luciddb tables (the other NVP become columns) at which point I can join > > them on the doc._id KEY and then start changing the grouping of data in > > real time (seconds vs a 16 hour view rebuild). Now when I have found wh= at > > I'm looking for (data and logic) and are happy that it's going to remai= n > > static (i.e. I don't have to change the grouping/index members and/or > > order) I/we create a proper map reduce of which we now have 13 now and = an > > example can be found at: > > > > https://gist.github.com/2788255 > > > > Oddly enough whilst these have far more complex JS than the map I > provided > > before (https://gist.github.com/2774485) they take a similar time to > > build (although the design doc takes 38 GB (post compaction) of space v= s > > the 8GB of raw documents!! i.e. they have less data in them so why are > they > > so much bigger) which suggest to me its the size of the documents that > > iterating through large documents is an issue. Even incremental view > update > > against 95 new documents takes 3.5 minutes (100% CPU burn), so I don't > get > > instant data access unless I use ?stale=3Dupdate_after > > > > So can I use I use a list function with all_docs and avoid view build f= or > > data discover? > > > > I know a couchdb view is equivalent to a DBMS index which is again why = I > > questioning why it takes so long to build them and why they use so much > > space > > > > I have couchdb lucene installed and its excellent, but only compounds m= y > > questions re view sizes and view generation as I can index the whole of > > all my documents in far less time than it takes to run the map reduces > and > > it only uses 2.4GB of disk space!! > > > > Clearly I seem to be a bit of a loan voice on this as everyone skirts > > around the why do views take so long to build, why do they only run on > one > > CPU and why do they take up so much space, but to my thinking view > > optimization would save a lot of CPU cycles and disk space which would > cut > > cloudants operational costs, allow some of that to be passed on to the > > customer and also benefit the wider community also. The cloud is a > utility > > model and cost management is key, plus it helps the environment which i= s > a > > finite resource! > > > > If I could just get an answer to the all_docs question I leave you alon= e > > then! > > > > Thanks > > > > Mike > > > > PS: if it's any consolation one of the chaps who works for me had a > > similar experience with Mongodb, although the use case was different :-= ). > > In that instance they moved to hadoop which to me is a bit overkill for > my > > paltry 50GB of data (8.5 GB compressed! > > > > -----Original Message----- > > From: Robert Newson [mailto:rnewson@apache.org] > > Sent: 25 May 2012 11:29 > > To: user@couchdb.apache.org > > Subject: Re: Am I doing something fundamentally wrong? > > > > Hi Mike, > > > > Several posters have been trying to tell you that you didn't need to > > build either of the views you posted. A view is to allow you to > > retrieve data efficiently by things other than the document id (or, > > with a reduce, to efficiently access aggregated data, sum, count and > > the like). In both of the views you posted you key by id. Instead of > > having either view you can use the _all_docs view with > > include_docs=3Dtrue. This view is built in lock-step with your updates, > > so it's never stale. > > > > These views would be worth having if super-low latency to those > > document fragments. If that's the case, then the cost is the view > > build time, but it doesn't sound like you need it. > > > > You say you don't know the ids you want to query, but your views are > > keyed on doc._id (same as _all_docs). I don't understand that. From > > this distance, it's seems you've built views you don't need to build > > and you have to read them in the entirety looking for the data you > > wanted. If that's true, or even half true, then it would explain your > > bad experience so far. > > > > Finally, I will close by saying that couchdb views are akin to SQL's > > 'CREATE INDEX'. Careful database design includes choosing which > > indexes to build (and which type) ahead of time. It's rare, and > > painful, to add indexes after the fact. For ad-hoc analysis, I wrote > > https://github.com/rnewson/couchdb-lucene. > > > > B. > > > > > > On 25 May 2012 10:52, Mike Kimber wrote: > > > I have done this; one view per design doc. I then query each one and > > wait. Currently there are 2 design docs and so 2 CPU's burn at 100%. Th= e > > other 2 CPU's/cores do nothing. > > > > > > On the all_docs option that's the point I don't know the id's I want = to > > analyse the attributes in my 80K set off documents to find the document= s > > that are relevant. Matthieu Rakotojaona summed it up in response to the > > other half of this post which you can see at: > > > > > > > > > http://mail-archives.apache.org/mod_mbox/couchdb-user/201205.mbox/%3CCAMi= ZLn1eLE9nZt3hOyxeMuOxgRfh2DWYaFEVCYO-iT3MpSsTAQ@mail.gmail.com%3E > > > > > > Thanks > > > > > > Mike > > > > > > -----Original Message----- > > > From: Sean Copenhaver [mailto:sean.copenhaver@gmail.com] > > > Sent: 24 May 2012 15:20 > > > To: user@couchdb.apache.org > > > Subject: Re: Am I doing something fundamentally wrong? > > > > > > I believe multiple design documents will build views concurrently but > one > > > design document is basically done sequentially by the change > sequence... > > > not positive. > > > > > > So you could try splitting out your views into multiple design > documents > > > and hit them to see if that helps spread out the CPU usage. I want to > > say a > > > lot of the CPU usage is the serialization process that is happening > > > communicating from CouchDB's core to the view engine process. > > > > > > Anyway with the list you specify any view and all_docs is a view with > all > > > documents in a database. So if you know the ids you want to work with > you > > > can doe a normal view query with a list function. > > > http://wiki.apache.org/couchdb/HTTP_Document_API#all_docs > > > > > > That's what Robert was trying to get at. > > > > > > On Thu, May 24, 2012 at 9:55 AM, Mike Kimber wrote= : > > > > > >> Robert, > > >> > > >> Couchdb Lists work on top of views (and look great by the way), > however > > >> that brings me back to my initial post (causes an error on this > mailing > > >> list for some reason but you can find a copy here > > >> > > > http://mail-archives.apache.org/mod_mbox/couchdb-user/201205.mbox/%3CA7D5= 0E04F38FD44D9D914F2ABCA592BF2E6E690685@BE259.mail.lan%3E > > ) > > >> :-). Namely generating a view (well a design document with views in > it) > > on > > >> our data set takes between 6 (simple view) and 16 hours, takes up a > lot > > of > > >> disk space for what seems a small amount of data and burns a CPU at > 100% > > >> for the full time it runs i.e. no IO contention and can't use multip= le > > >> cores/cpus. So again am I doing something fundamentally wrong or is > this > > >> just the way Couch works and most people don't have a data set like > > ours so > > >> it does not take that long to create views or does Big Couch solve t= he > > >> issue (although it would seem 10 big couch nodes would still take an > > hour) > > >> > > >> Looks like you work at Cloudant, so hopefully you might be able to > > provide > > >> some answers based on real world experience? > > >> > > >> Mike > > >> > > >> > > >> > > >> -----Original Message----- > > >> From: Robert Newson [mailto:rnewson@apache.org] > > >> Sent: 24 May 2012 12:08 > > >> To: user@couchdb.apache.org > > >> Subject: Re: Am I doing something fundamentally wrong? > > >> > > >> Or use a list function; > > >> > > >> http://wiki.apache.org/couchdb/Formatting_with_Show_and_List > > >> > > >> You can use one with _all_docs and you can POST an array of ids too. > > >> > > >> http://wiki.apache.org/couchdb/HTTP_view_API > > >> > > >> > Since 0.9 you can also issue POST requests to views where you can > send > > >> the following JSON structure in the body: > > >> > {"keys": ["key1", "key2", ...]} > > >> > > >> B. > > >> > > >> On 24 May 2012 11:58, Mike Kimber wrote: > > >> > Looking at Show documentation and running a quick test I don't thi= nk > > >> this helps as Show has to be referenced by a doc._id or view key. If > > these > > >> aren't provided it returns null. This makes sense as its for > generation > > of > > >> a html, XML page/doc etc. > > >> > > > >> > So I'd have to get a list of all doc ID's I want and then call th= e > > show > > >> function for each and to get a filtered list I need a view. > > >> > > > >> > Mike > > >> > > > >> > -----Original Message----- > > >> > From: Mike Kimber [mailto:mkimber@kana.com] > > >> > Sent: 24 May 2012 10:47 > > >> > To: user@couchdb.apache.org > > >> > Subject: RE: Am I doing something fundamentally wrong? > > >> > > > >> > Aur=E9lien, > > >> > > > >> > Thanks for the response and apologies I didn't get a notification > > >> (e-mail) of my original post (or the 2nd one) or your response. When= I > > look > > >> at my original post in Google Reader is has "An error occurred while > > >> fetching this message, sorry !", so there must be something in the > > e-mail > > >> that the mailing list system does not like. > > >> > > > >> > In response to your original response " I'm a bit puzzled by the > fact > > >> that your map functions use the document ID". I do this because I lo= ad > > the > > >> data into Luciddb and this allows me to join between tables. This is > > not my > > >> end game this is just a compromise due to the time it takes to > generate > > a > > >> view and my need to play/discover with the data. > > >> > > > >> > I will look at show to see if It helps, however it does not really > > >> answer my original questions and it does not remove the more general > > issue > > >> that view build takes a very long time, it only uses a single CPU an= d > > uses > > >> a bucket load of space even with compression on (no idea why when it > > has a > > >> lot less data than the original) > > >> > > > >> > Thanks > > >> > > > >> > Mike > > >> > > > >> > -----Original Message----- > > >> > From: Aur=E9lien B=E9nel [mailto:aurelien.benel@utt.fr] > > >> > Sent: 24 May 2012 07:40 > > >> > To: user@couchdb.apache.org > > >> > Subject: Re: Am I doing something fundamentally wrong? > > >> > > > >> > Hi Mike, > > >> > > > >> >> Didn't seem to get there first time so having another go > > >> > > > >> > As I wrote in my earlier post, the use of 'map' functions in both = of > > >> your examples is overkill. > > >> > Use 'show' functions instead.They won't require an index to be > built. > > >> > > > >> > > > >> > Regards, > > >> > > > >> > Aur=E9lien > > >> > > > > > > > > > > > > -- > > > "The limits of language are the limits of one's world. " - Ludwig von > > > Wittgenstein > > > > > > "Water is fluid, soft and yielding. But water will wear away rock, > which > > is > > > rigid and cannot yield. As a rule, whatever is fluid, soft and yieldi= ng > > > will overcome whatever is rigid and hard. This is another paradox: wh= at > > is > > > soft is strong." - Lao-Tzu > > > --0015175df0522144e304c0de1298--