couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <adam.kocolo...@gmail.com>
Subject Re: 3 Newbie questions
Date Thu, 25 Sep 2008 14:17:16 GMT
Hi Will,

On Sep 25, 2008, at 9:17 AM, Will Schenk wrote:

> Hey all
>
> I went to jchris' talk at columbia a week or so ago, and have been  
> playing around with couchdb every since.  I'm not sure that I like  
> all of the high level design decisions (i for one like types, and I  
> think I'm running in this below) but I wanted to actually use it  
> because there certainly are some neat things.  I have a couple of  
> newbie usage questions.
>
> The scenario is that I'm building a spider-type thing and going to  
> be processing the "remote resource" into a specific localized  
> "document".  So I'm going to be pulling in say 3 pages and from that  
> going to produce both a document that describes the resteraunt, has  
> it's menu and reviews, knows it's lat and long and going to serve  
> them up on a map.  I'm following the architecture that I described  
> in http://benchcoach.com/papers/scraping and am basically  
> reimplementing menumaps as a proof of concept with couchdb.
>
> I'm using merb and relaxdb at the moment, but I think I may need to  
> get a little lower level.
>
> Question 1:  Where do I store the original documents?
>
> Right now I have a "RemoteUrl" document which contains the last- 
> modified, etag, encoding, and the content itself.  (Its very  
> important in the design that I keep the original content and all  
> previous versions around.)  For some reason, I can't store the  
> content directly -- I need to base64 encode it, which seems like a  
> problem with the ruby json library.  But couchbd is slow when it has  
> all of these 200K documents sitting around in it.  Is this not the  
> right sort of usage?  I've created a map/reduce view for the "latest  
> revision" like this:
>
> function(doc) {
>   if( doc.class == "RemoteUrl" && doc.content ) {
>     emit(doc.normalized_url, doc );
>   }
>  }
>
> function(key,values,rereduce) {
>  if(rereduce) {
>    return values;
>  } else {
>    var max = values[0].created_at;
>    var doc = values[0];
>    for( i = 0; i < values.length; i++ ) {
>     if( values[i].created_at > max ) {
>      doc = values[i]; max=values[i].created_at;
>      }
>    }
>  }
>  return doc;
> }
>
> but it takes a long time to run and see.  But this is probably  
> because it actually needs to load up all the data, send it over a  
> pipe to the javascript process, have it do it, send it back, and  
> repeat the process with the reduce step.  Its pretty slow.  Is there  
> a better way to do this?  My guess is that if I make this into two  
> queries, and don't emit the doc.content (the actually content of the  
> file) that it would be a lot faster, but that seems pretty ugly.   
> I.e. just map [doc._id,doc.created_at] and then make another trip to  
> pull back the id.  But you still have the problem of view creation  
> taking forever.  Does anyone have any suggestions?

You should be able to get the latest revision without a reduce  
(generally a good thing to avoid if you can).  Something like

function(doc) {
   if( doc.class == "RemoteUrl" && doc.content ) {
     emit([doc.normalized_url, doc.created_at], doc);
   }
}

will give you all your documents sorted first by URL and then by  
revision time.  Then you can query the view with some combination of  
startkey, count, and maybe descending=true (depending on how your  
revision dates sort) to get the latest revision of a particular doc.   
Alternatively, if you wanted to suppress all old revisions in the view  
you could add a simpler reduce function which takes advantage of the  
map sorting the results for you:

function(keys, values) {
   return values[0]; // or maybe values.pop();
}

By the way, you're using design documents and not _temp_views, right?   
View index may currently be slow, but with a design doc you only have  
to do it once.

You'll have to test and see whether it's better to emit the doc in the  
view code or do a second trip to the DB to retrieve it.  Both are  
valid, and I think there's a patch in the works to add an  
"include_docs" or similar parameter so that you can optionally  
retrieve the associated document for any row of any view.  Reduce  
generally works best with small amounts of data.

> Question 2: I'm using RelaxDB right now, and it only really wants to  
> work for one database per environment.  Seeing as how slow couchbd  
> processing these documents, I was thinking that I'd want to keep the  
> bulk-data stuff in it's own database so that the other views won't  
> need to process over the whole data set.  The views really work by  
> document "type", so there's no need to pump the huge amount of data  
> from the erlang process to the javascript process when all it's  
> doing is seeing that doc.class != "RemoteUrl".  (Which is why I'd  
> want to have types, but no matter, I guess we can hack them on the  
> side like this!)  I'm guessing that's why it's falling down.  So if  
> I could spit out the "web-cache" database from "parsed" database I  
> think it would be a little faster.  I'm wonder what people think  
> about this sort of design decision, and how they would suggest  
> implementing it.

If you don't need to analyze the bulk-data in any view you could  
consider storing it as an attachment to a doc.  Details are at the  
bottom of this page:

http://wiki.apache.org/couchdb/HttpDocumentApi

> Question 3: Say I eventually get something which has 2 floating  
> point attributes.  (e.g. lat and long.)  How would I get all of the  
> documents where those were in a square?  i.e.
> select * from places where places.lat >= left_lat and places.lat <=  
> right_lat and places.longitude >= top_longitude and places <=  
> bottom_longitude;
>
> I can see how'd you'd do this with one dimension, but I'm not sure  
> how you'd do it with the second.  Especially since you need to make  
> all these data round trips...

Yeah, I guess that's a bit tricky.  If the data volume doesn't get in  
your way you could emit [places.lat, places.long] as the key, query  
the view with your latitude range as startkey and endkey, and then  
pick out documents in your longitude range client-side.  Others may  
well have more clever suggestions.  Best,

Adam

> Thanks in advance.  I think there are a lot of very interesting  
> ideas in Couchdb. It seems like a lot of this stuff can't be don't  
> nearly as well as with a sql database, but I'm hoping that it's just  
> me being ignorant.
>
> -w
> http://sublimeguile.com
>
>


Mime
View raw message