incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Will Schenk <wsch...@gmail.com>
Subject 3 Newbie questions
Date Thu, 25 Sep 2008 13:17:28 GMT
Hey all 

I went to jchris' talk at columbia a week or so ago, and have been 
playing around with couchdb every since.  I'm not sure that I like all 
of the high level design decisions (i for one like types, and I think 
I'm running in this below) but I wanted to actually use it because 
there certainly are some neat things.  I have a couple of newbie usage 
questions.

The scenario is that I'm building a spider-type thing and going to be 
processing the "remote resource" into a specific localized "document". 
 So I'm going to be pulling in say 3 pages and from that going to 
produce both a document that describes the resteraunt, has it's menu 
and reviews, knows it's lat and long and going to serve them up on a 
map.  I'm following the architecture that I described 
in http://benchcoach.com/papers/scraping and am basically 
reimplementing menumaps as a proof of concept with couchdb.

I'm using merb and relaxdb at the moment, but I think I may need to get 
a little lower level.

Question 1:  Where do I store the original documents?

Right now I have a "RemoteUrl" document which contains the 
last-modified, etag, encoding, and the content itself.  (Its very 
important in the design that I keep the original content and all 
previous versions around.)  For some reason, I can't store the content 
directly -- I need to base64 encode it, which seems like a problem with 
the ruby json library.  But couchbd is slow when it has all of these 
200K documents sitting around in it.  Is this not the right sort of 
usage?  I've created a map/reduce view for the "latest revision" like 
this:

function(doc) { 
  if( doc.class == "RemoteUrl" && doc.content ) { 
    emit(doc.normalized_url, doc ); 
  }
 }

function(key,values,rereduce) {
 if(rereduce) {
   return values;
 } else {
   var max = values[0].created_at;
   var doc = values[0];
   for( i = 0; i < values.length; i++ ) {
    if( values[i].created_at > max ) {
     doc = values[i]; max=values[i].created_at;
     }
   }
 }
 return doc;
}

but it takes a long time to run and see.  But this is probably because 
it actually needs to load up all the data, send it over a pipe to the 
javascript process, have it do it, send it back, and repeat the process 
with the reduce step.  Its pretty slow.  Is there a better way to do 
this?  My guess is that if I make this into two queries, and don't emit 
the doc.content (the actually content of the file) that it would be a 
lot faster, but that seems pretty ugly.  I.e. just map 
[doc._id,doc.created_at] and then make another trip to pull back the 
id.  But you still have the problem of view creation taking forever. 
 Does anyone have any suggestions?

Question 2: I'm using RelaxDB right now, and it only really wants to 
work for one database per environment.  Seeing as how slow couchbd 
processing these documents, I was thinking that I'd want to keep the 
bulk-data stuff in it's own database so that the other views won't need 
to process over the whole data set.  The views really work by document 
"type", so there's no need to pump the huge amount of data from the 
erlang process to the javascript process when all it's doing is seeing 
that doc.class != "RemoteUrl".  (Which is why I'd want to have types, 
but no matter, I guess we can hack them on the side like this!)  I'm 
guessing that's why it's falling down.  So if I could spit out the 
"web-cache" database from "parsed" database I think it would be a 
little faster.  I'm wonder what people think about this sort of design 
decision, and how they would suggest implementing it.

Question 3: Say I eventually get something which has 2 floating point 
attributes.  (e.g. lat and long.)  How would I get all of the documents 
where those were in a square?  i.e.
select * from places where places.lat >= left_lat and places.lat <= 
right_lat and places.longitude >= top_longitude and places <= 
bottom_longitude;

I can see how'd you'd do this with one dimension, but I'm not sure how 
you'd do it with the second.  Especially since you need to make all 
these data round trips...

Thanks in advance.  I think there are a lot of very interesting ideas 
in Couchdb. It seems like a lot of this stuff can't be don't nearly as 
well as with a sql database, but I'm hoping that it's just me being 
ignorant.

-w
http://sublimeguile.com



Mime
View raw message