Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (athena.apache.org: domain of ziggythehamster@gmail.com
 designates 209.85.214.52 as permitted sender)
MIME-Version: 1.0
Sender: ziggythehamster@gmail.com
In-Reply-To: <A7D50E04F38FD44D9D914F2ABCA592BF2E6E69078B@BE259.mail.lan>
References: <A7D50E04F38FD44D9D914F2ABCA592BF2E6E690685@BE259.mail.lan>
	<08E52809-C962-4E9C-AFB8-397EA201580E@utt.fr>
	<A7D50E04F38FD44D9D914F2ABCA592BF2E6E6906AD@BE259.mail.lan>
	<A7D50E04F38FD44D9D914F2ABCA592BF2E6E6906BB@BE259.mail.lan>
	<CABvT1DGGZ0344rwYwbS=n=ctF2YKjLWVqg-fq4zX8YhyAE8QEQ@mail.gmail.com>
	<A7D50E04F38FD44D9D914F2ABCA592BF2E6E6906DE@BE259.mail.lan>
	<CAPmACT7BMOQHF0Ny==qPqrP4kinAa9eCPy9bfF=CV2ucas_WjQ@mail.gmail.com>
	<A7D50E04F38FD44D9D914F2ABCA592BF2E6E690751@BE259.mail.lan>
	<CABvT1DHd67vuXnO+5aC4QOEGFW56x0tDBJEiKCx8anbrWxWs+Q@mail.gmail.com>
	<A7D50E04F38FD44D9D914F2ABCA592BF2E6E690788@BE259.mail.lan>
	<CAC7vo1CqDrZbk7BLwAxHv8QVTQcAf6qZp733+PayMOkiK+RiNA@mail.gmail.com>
	<A7D50E04F38FD44D9D914F2ABCA592BF2E6E69078B@BE259.mail.lan>
Date: Fri, 25 May 2012 10:32:06 -0500
Message-ID: 
 <CAC7vo1AEybzvGkq6RAe6h2y8NxqexHOF2VugXKTXa7KtPu0syw@mail.gmail.com>
Subject: RE: Am I doing something fundamentally wrong?
From: Keith Gable <ziggy@ignition-project.com>
To: user@couchdb.apache.org
Content-Type: multipart/alternative; boundary=0015175df0522144e304c0de1298

--0015175df0522144e304c0de1298
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

All of our documents are at most 500 KB. Like others have said,
serialization and deserialization is going to take a while with 10MB
documents. Definitely don't emit them in a map :-)
On May 25, 2012 10:09 AM, "Mike Kimber" <mkimber@kana.com> wrote:

> Keith,
>
> Thanks for the reference point. From what I can tell the issue is not the
> number of the documents its the size, we only have 90K of them but they a=
re
> on average 0.5MB in size with 600+ of them being over 10MB. How big is on=
e
> of your documents?
>
> Mike
>
> -----Original Message-----
> From: ziggythehamster@gmail.com [mailto:ziggythehamster@gmail.com] On
> Behalf Of Keith Gable
> Sent: 25 May 2012 16:00
> To: user@couchdb.apache.org
> Subject: RE: Am I doing something fundamentally wrong?
>
> View generation only takes a second for me... my data set is half a GB or
> so and it takes up around 800MB including about a dozen views.
>
> One thing to remember is never to emit whole documents or doc IDs. Those
> are included for free, so emitting a doc or doc id is a waste of CPU cycl=
es
> and space.
> On May 25, 2012 9:44 AM, "Mike Kimber" <mkimber@kana.com> wrote:
>
> > :-) e-mail is not a very good form of communication; apologies. As I sa=
id
> > in my original post we live in a world of "clever" work around's and  t=
he
> > map's you are referring to are my "clever" work around for the time it
> > takes to build views against our data set so that I can do ad-hoc
> > analytics/information discover on our data set.
> >
> > I place the doc._id in the emit KEY as its guaranteed unique to each
> > document. I create a map with header information in and then I create a
> > detailed map out of a array that's in my document, this has the same
> > document doc._id as its KEY. I then use the Luciddb couchdb connector (
> > https://github.com/dynamobi/conn-couchdb ) to pull the two map/views
> into
> > luciddb tables (the other NVP become columns) at which point I can join
> > them on the doc._id KEY and then start changing the grouping of data in
> > real time (seconds vs a 16 hour view rebuild). Now when I have found wh=
at
> > I'm looking for (data and logic) and are happy that it's going to remai=
n
> > static (i.e. I don't have to change the grouping/index members and/or
> > order) I/we create a proper map reduce of which we now have 13 now and =
an
> > example can be found at:
> >
> > https://gist.github.com/2788255
> >
> > Oddly enough whilst these have far more complex JS than the map I
> provided
> > before (https://gist.github.com/2774485) they take a similar time to
> > build (although the design doc takes 38 GB (post compaction) of space v=
s
> > the 8GB of raw documents!! i.e. they have less data in them so why are
> they
> > so much bigger) which suggest to me its the size of the documents that
> > iterating through large documents is an issue. Even incremental view
> update
> > against 95 new documents takes 3.5 minutes (100% CPU burn), so I don't
> get
> > instant data access unless I use ?stale=3Dupdate_after
> >
> > So can I use I use a list function with all_docs and avoid view build f=
or
> > data discover?
> >
> > I know a couchdb view is equivalent to a DBMS index which is again why =
I
> > questioning why it takes so long to build them and why they use so much
> > space
> >
> > I have couchdb lucene installed and its excellent, but only compounds m=
y
> > questions re view sizes and view generation as I can index the whole of
> >  all my documents in far less time than it takes to run the map reduces
> and
> > it only uses 2.4GB of disk space!!
> >
> > Clearly I seem to be a bit of a loan voice on this as everyone skirts
> > around the why do views take so long to build, why do they only run on
> one
> > CPU and why do they take up so much space, but to my thinking view
> > optimization would save a lot of CPU cycles and disk space which would
> cut
> > cloudants operational costs, allow some of that to be passed on to the
> > customer and also benefit the wider community also. The cloud is a
> utility
> > model and cost management is key, plus it helps the environment which i=
s
> a
> > finite resource!
> >
> > If I could just get an answer to the all_docs question I leave you alon=
e
> > then!
> >
> > Thanks
> >
> > Mike
> >
> > PS: if it's any consolation one of the chaps who works for me had a
> > similar experience with Mongodb, although the use case was different :-=
).
> > In that instance they moved to hadoop which to me is a bit overkill for
> my
> > paltry 50GB of data (8.5 GB compressed!
> >
> > -----Original Message-----
> > From: Robert Newson [mailto:rnewson@apache.org]
> > Sent: 25 May 2012 11:29
> > To: user@couchdb.apache.org
> > Subject: Re: Am I doing something fundamentally wrong?
> >
> > Hi Mike,
> >
> > Several posters have been trying to tell you that you didn't need to
> > build either of the views you posted. A view is to allow you to
> > retrieve data efficiently by things other than the document id (or,
> > with a reduce, to efficiently access aggregated data, sum, count and
> > the like). In both of the views you posted you key by id. Instead of
> > having either view you can use the _all_docs view with
> > include_docs=3Dtrue. This view is built in lock-step with your updates,
> > so it's never stale.
> >
> > These views would be worth having if super-low latency to those
> > document fragments. If that's the case, then the cost is the view
> > build time, but it doesn't sound like you need it.
> >
> > You say you don't know the ids you want to query, but your views are
> > keyed on doc._id (same as _all_docs). I don't understand that. From
> > this distance, it's seems you've built views you don't need to build
> > and you have to read them in the entirety looking for the data you
> > wanted. If that's true, or even half true, then it would explain your
> > bad experience so far.
> >
> > Finally, I will close by saying that couchdb views are akin to SQL's
> > 'CREATE INDEX'. Careful database design  includes choosing which
> > indexes to build (and which type) ahead of time. It's rare, and
> > painful, to add indexes after the fact. For ad-hoc analysis, I wrote
> > https://github.com/rnewson/couchdb-lucene.
> >
> > B.
> >
> >
> > On 25 May 2012 10:52, Mike Kimber <mkimber@kana.com> wrote:
> > > I have done this; one view per design doc. I then query each one and
> > wait. Currently there are 2 design docs and so 2 CPU's burn at 100%. Th=
e
> > other 2 CPU's/cores do nothing.
> > >
> > > On the all_docs option that's the point I don't know the id's I want =
to
> > analyse the attributes in my 80K set off documents to find the document=
s
> > that are relevant. Matthieu Rakotojaona summed it up in response to the
> > other half of this post which you can see at:
> > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox/couchdb-user/201205.mbox/%3CCAMi=
ZLn1eLE9nZt3hOyxeMuOxgRfh2DWYaFEVCYO-iT3MpSsTAQ@mail.gmail.com%3E
> > >
> > > Thanks
> > >
> > > Mike
> > >
> > > -----Original Message-----
> > > From: Sean Copenhaver [mailto:sean.copenhaver@gmail.com]
> > > Sent: 24 May 2012 15:20
> > > To: user@couchdb.apache.org
> > > Subject: Re: Am I doing something fundamentally wrong?
> > >
> > > I believe multiple design documents will build views concurrently but
> one
> > > design document is basically done sequentially by the change
> sequence...
> > > not positive.
> > >
> > > So you could try splitting out your views into multiple design
> documents
> > > and hit them to see if that helps spread out the CPU usage. I want to
> > say a
> > > lot of the CPU usage is the serialization process that is happening
> > > communicating from CouchDB's core to the view engine process.
> > >
> > > Anyway with the list you specify any view and all_docs is a view with
> all
> > > documents in a database. So if you know the ids you want to work with
> you
> > > can doe a normal view query with a list function.
> > > http://wiki.apache.org/couchdb/HTTP_Document_API#all_docs
> > >
> > > That's what Robert was trying to get at.
> > >
> > > On Thu, May 24, 2012 at 9:55 AM, Mike Kimber <mkimber@kana.com> wrote=
:
> > >
> > >> Robert,
> > >>
> > >> Couchdb Lists work on top of views (and look great by the way),
> however
> > >> that brings me back to my initial post (causes an error on this
> mailing
> > >> list for some reason but you can find a copy here
> > >>
> >
> http://mail-archives.apache.org/mod_mbox/couchdb-user/201205.mbox/%3CA7D5=
0E04F38FD44D9D914F2ABCA592BF2E6E690685@BE259.mail.lan%3E
> > )
> > >> :-). Namely generating a view (well a design document with views in
> it)
> > on
> > >> our data set takes between 6 (simple view) and 16 hours, takes up a
> lot
> > of
> > >> disk space for what seems a small amount of data and burns a CPU at
> 100%
> > >> for the full time it runs i.e. no IO contention and can't use multip=
le
> > >> cores/cpus. So again am I doing something fundamentally wrong or is
> this
> > >> just the way Couch works and most people don't have a data set like
> > ours so
> > >> it does not take that long to create views or does Big Couch solve t=
he
> > >> issue (although it would seem 10 big couch nodes would still take an
> > hour)
> > >>
> > >> Looks like you work at Cloudant, so hopefully you might be able to
> > provide
> > >> some answers based on real world experience?
> > >>
> > >> Mike
> > >>
> > >>
> > >>
> > >> -----Original Message-----
> > >> From: Robert Newson [mailto:rnewson@apache.org]
> > >> Sent: 24 May 2012 12:08
> > >> To: user@couchdb.apache.org
> > >> Subject: Re: Am I doing something fundamentally wrong?
> > >>
> > >> Or use a list function;
> > >>
> > >> http://wiki.apache.org/couchdb/Formatting_with_Show_and_List
> > >>
> > >> You can use one with _all_docs and you can POST an array of ids too.
> > >>
> > >> http://wiki.apache.org/couchdb/HTTP_view_API
> > >>
> > >> > Since 0.9 you can also issue POST requests to views where you can
> send
> > >> the following JSON structure in the body:
> > >> > {"keys": ["key1", "key2", ...]}
> > >>
> > >> B.
> > >>
> > >> On 24 May 2012 11:58, Mike Kimber <mkimber@kana.com> wrote:
> > >> > Looking at Show documentation and running a quick test I don't thi=
nk
> > >> this helps as Show has to be referenced by a doc._id or view key. If
> > these
> > >> aren't provided it returns null. This makes sense as its for
> generation
> > of
> > >> a html, XML page/doc etc.
> > >> >
> > >> > So I'd have to  get a list of all doc ID's I want and then call th=
e
> > show
> > >> function for each and to get a filtered list I need a view.
> > >> >
> > >> > Mike
> > >> >
> > >> > -----Original Message-----
> > >> > From: Mike Kimber [mailto:mkimber@kana.com]
> > >> > Sent: 24 May 2012 10:47
> > >> > To: user@couchdb.apache.org
> > >> > Subject: RE: Am I doing something fundamentally wrong?
> > >> >
> > >> > Aur=E9lien,
> > >> >
> > >> > Thanks for the response and apologies I didn't get a notification
> > >> (e-mail) of my original post (or the 2nd one) or your response. When=
 I
> > look
> > >> at my original post in Google Reader is has "An error occurred while
> > >> fetching this message, sorry !", so there must be something in the
> > e-mail
> > >> that the mailing list system does not like.
> > >> >
> > >> > In response to your original response " I'm a bit puzzled by the
> fact
> > >> that your map functions use the document ID". I do this because I lo=
ad
> > the
> > >> data into Luciddb and this allows me to join between tables. This is
> > not my
> > >> end game this is just a compromise due to the time it takes to
> generate
> > a
> > >> view and my need to play/discover with the data.
> > >> >
> > >> > I will look at show to see if It helps, however it does not really
> > >> answer my original questions and it does not remove the more general
> > issue
> > >> that view build takes a very long time, it only uses a single CPU an=
d
> > uses
> > >> a bucket load of space even with compression on (no idea why when it
> > has a
> > >> lot less data than the original)
> > >> >
> > >> > Thanks
> > >> >
> > >> > Mike
> > >> >
> > >> > -----Original Message-----
> > >> > From: Aur=E9lien B=E9nel [mailto:aurelien.benel@utt.fr]
> > >> > Sent: 24 May 2012 07:40
> > >> > To: user@couchdb.apache.org
> > >> > Subject: Re: Am I doing something fundamentally wrong?
> > >> >
> > >> > Hi Mike,
> > >> >
> > >> >> Didn't seem to get there first time so having another go
> > >> >
> > >> > As I wrote in my earlier post, the use of 'map' functions in both =
of
> > >> your examples is overkill.
> > >> > Use 'show' functions instead.They won't require an index to be
> built.
> > >> >
> > >> >
> > >> > Regards,
> > >> >
> > >> > Aur=E9lien
> > >>
> > >
> > >
> > >
> > > --
> > > "The limits of language are the limits of one's world. " - Ludwig von
> > > Wittgenstein
> > >
> > > "Water is fluid, soft and yielding. But water will wear away rock,
> which
> > is
> > > rigid and cannot yield. As a rule, whatever is fluid, soft and yieldi=
ng
> > > will overcome whatever is rigid and hard. This is another paradox: wh=
at
> > is
> > > soft is strong." - Lao-Tzu
> >
>

--0015175df0522144e304c0de1298--