incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Lehnardt <...@apache.org>
Subject Re: 3 Newbie questions
Date Thu, 25 Sep 2008 14:21:05 GMT

On Sep 25, 2008, at 15:17 , Will Schenk wrote:

> Hey all
>
> I went to jchris' talk at columbia a week or so ago, and have been  
> playing around with couchdb every since.  I'm not sure that I like  
> all of the high level design decisions (i for one like types, and I  
> think I'm running in this below)

We like types as well, we are just not that strict about them :)


> but I wanted to actually use it because there certainly are some  
> neat things.  I have a couple of newbie usage questions.
>
> The scenario is that I'm building a spider-type thing and going to  
> be processing the "remote resource" into a specific localized  
> "document".  So I'm going to be pulling in say 3 pages and from that  
> going to produce both a document that describes the resteraunt, has  
> it's menu and reviews, knows it's lat and long and going to serve  
> them up on a map.  I'm following the architecture that I described  
> in http://benchcoach.com/papers/scraping and am basically  
> reimplementing menumaps as a proof of concept with couchdb.
>
> I'm using merb and relaxdb at the moment, but I think I may need to  
> get a little lower level.
>
> Question 1:  Where do I store the original documents?
>
> Right now I have a "RemoteUrl" document which contains the last- 
> modified, etag, encoding, and the content itself.  (Its very  
> important in the design that I keep the original content and all  
> previous versions around.)  For some reason, I can't store the  
> content directly -- I need to base64 encode it, which seems like a  
> problem with the ruby json library.

Yes, there is an API that doesn't require base64. See "Standalone  
Attachments" on http://wiki.apache.org/couchdb/HttpDocumentApi

You could try using either attachments to a single document for the  
original data or use a document + one attachment per revision of your  
remote resource and treat the entity "remote object & history" as a  
view.


> But couchbd is slow when it has all of these 200K documents sitting  
> around in it.  Is this not the right sort of usage?  I've created a  
> map/reduce view for the "latest revision" like this:
>
>
> function(doc) {
>   if( doc.class == "RemoteUrl" && doc.content ) {
>     emit(doc.normalized_url, doc );
>   }
>  }
>
> function(key,values,rereduce) {
>  if(rereduce) {
>    return values;
>  } else {
>    var max = values[0].created_at;
>    var doc = values[0];
>    for( i = 0; i < values.length; i++ ) {
>     if( values[i].created_at > max ) {
>      doc = values[i]; max=values[i].created_at;
>      }
>    }
>  }
>  return doc;
> }
>
> but it takes a long time to run and see.  But this is probably  
> because it actually needs to load up all the data, send it over a  
> pipe to the javascript process, have it do it, send it back, and  
> repeat the process with the reduce step.  Its pretty slow.  Is there  
> a better way to do this?  My guess is that if I make this into two  
> queries, and don't emit the doc.content (the actually content of the  
> file) that it would be a lot faster, but that seems pretty ugly.   
> I.e. just map [doc._id,doc.created_at] and then make another trip to  
> pull back the id.  But you still have the problem of view creation  
> taking forever.  Does anyone have any suggestions?

a) "The value returned from a reduce should grow at a rate no bigger  
than log(N) of values processed". This is why you see your view being  
slow.

b) the back-and forth of data between CouchDB and Spidermonkey  
certainly takes some time, but it is not the limiting factor here. A  
document is indexed only once. Results are cached. Your first query  
will go through all your data, all subsequent queries will be lighting  
fast.


> Question 2: I'm using RelaxDB right now, and it only really wants to  
> work for one database per environment.  Seeing as how slow couchbd  
> processing these documents, I was thinking that I'd want to keep the  
> bulk-data stuff in it's own database so that the other views won't  
> need to process over the whole data set.

See above, wrong conclusion. Treating databases as tables is a  
terrible idea. Sorry RelaxDB folks. See http://upstream-berlin.com/2008/09/25/a-couchdb-primer-for-an-activerecord-mindset/

  for a discussion on different Ruby libs.


>  The views really work by document "type", so there's no need to  
> pump the huge amount of data from the erlang process to the  
> javascript process when all it's doing is seeing that doc.class !=  
> "RemoteUrl".  (Which is why I'd want to have types, but no matter, I  
> guess we can hack them on the side like this!)  I'm guessing that's  
> why it's falling down.  So if I could spit out the "web-cache"  
> database from "parsed" database I think it would be a little  
> faster.  I'm wonder what people think about this sort of design  
> decision, and how they would suggest implementing it.
>
> Question 3: Say I eventually get something which has 2 floating  
> point attributes.  (e.g. lat and long.)  How would I get all of the  
> documents where those were in a square?  i.e.
> select * from places where places.lat >= left_lat and places.lat <=  
> right_lat and places.longitude >= top_longitude and places <=  
> bottom_longitude;
>
> I can see how'd you'd do this with one dimension, but I'm not sure  
> how you'd do it with the second.  Especially since you need to make  
> all these data round trips...

Again, no big data-roundtrip problems. See above. With views, you  
could do each dimension in a separate query and then intersect the two  
results in your application. Or employ a GIS indexer / searcher over  
the external indexing interface.


> Thanks in advance.  I think there are a lot of very interesting  
> ideas in Couchdb. It seems like a lot of this stuff can't be don't  
> nearly as well as with a sql database, but I'm hoping that it's just  
> me being ignorant.

CouchDB is no silver bullet. Some things an RDBMS is designed for are  
not elegantly solvable in CouchDB. I find it hard though, to find  
these special cases :) In general, I can only advise to try and forget  
everything you know about databases from the RDBMS-point-of-view.  
It'll take some time (took me months) to get into the CouchDB mindset.  
A lot of the "hard-to-impossible"-scenarios start to look easier.

Cheers
Jan
--




Mime
View raw message