Hi Will,
On Sep 25, 2008, at 9:17 AM, Will Schenk wrote:
> Hey all
>
> I went to jchris' talk at columbia a week or so ago, and have been
> playing around with couchdb every since. I'm not sure that I like
> all of the high level design decisions (i for one like types, and I
> think I'm running in this below) but I wanted to actually use it
> because there certainly are some neat things. I have a couple of
> newbie usage questions.
>
> The scenario is that I'm building a spider-type thing and going to
> be processing the "remote resource" into a specific localized
> "document". So I'm going to be pulling in say 3 pages and from that
> going to produce both a document that describes the resteraunt, has
> it's menu and reviews, knows it's lat and long and going to serve
> them up on a map. I'm following the architecture that I described
> in http://benchcoach.com/papers/scraping and am basically
> reimplementing menumaps as a proof of concept with couchdb.
>
> I'm using merb and relaxdb at the moment, but I think I may need to
> get a little lower level.
>
> Question 1: Where do I store the original documents?
>
> Right now I have a "RemoteUrl" document which contains the last-
> modified, etag, encoding, and the content itself. (Its very
> important in the design that I keep the original content and all
> previous versions around.) For some reason, I can't store the
> content directly -- I need to base64 encode it, which seems like a
> problem with the ruby json library. But couchbd is slow when it has
> all of these 200K documents sitting around in it. Is this not the
> right sort of usage? I've created a map/reduce view for the "latest
> revision" like this:
>
> function(doc) {
> if( doc.class == "RemoteUrl" && doc.content ) {
> emit(doc.normalized_url, doc );
> }
> }
>
> function(key,values,rereduce) {
> if(rereduce) {
> return values;
> } else {
> var max = values[0].created_at;
> var doc = values[0];
> for( i = 0; i < values.length; i++ ) {
> if( values[i].created_at > max ) {
> doc = values[i]; max=values[i].created_at;
> }
> }
> }
> return doc;
> }
>
> but it takes a long time to run and see. But this is probably
> because it actually needs to load up all the data, send it over a
> pipe to the javascript process, have it do it, send it back, and
> repeat the process with the reduce step. Its pretty slow. Is there
> a better way to do this? My guess is that if I make this into two
> queries, and don't emit the doc.content (the actually content of the
> file) that it would be a lot faster, but that seems pretty ugly.
> I.e. just map [doc._id,doc.created_at] and then make another trip to
> pull back the id. But you still have the problem of view creation
> taking forever. Does anyone have any suggestions?
You should be able to get the latest revision without a reduce
(generally a good thing to avoid if you can). Something like
function(doc) {
if( doc.class == "RemoteUrl" && doc.content ) {
emit([doc.normalized_url, doc.created_at], doc);
}
}
will give you all your documents sorted first by URL and then by
revision time. Then you can query the view with some combination of
startkey, count, and maybe descending=true (depending on how your
revision dates sort) to get the latest revision of a particular doc.
Alternatively, if you wanted to suppress all old revisions in the view
you could add a simpler reduce function which takes advantage of the
map sorting the results for you:
function(keys, values) {
return values[0]; // or maybe values.pop();
}
By the way, you're using design documents and not _temp_views, right?
View index may currently be slow, but with a design doc you only have
to do it once.
You'll have to test and see whether it's better to emit the doc in the
view code or do a second trip to the DB to retrieve it. Both are
valid, and I think there's a patch in the works to add an
"include_docs" or similar parameter so that you can optionally
retrieve the associated document for any row of any view. Reduce
generally works best with small amounts of data.
> Question 2: I'm using RelaxDB right now, and it only really wants to
> work for one database per environment. Seeing as how slow couchbd
> processing these documents, I was thinking that I'd want to keep the
> bulk-data stuff in it's own database so that the other views won't
> need to process over the whole data set. The views really work by
> document "type", so there's no need to pump the huge amount of data
> from the erlang process to the javascript process when all it's
> doing is seeing that doc.class != "RemoteUrl". (Which is why I'd
> want to have types, but no matter, I guess we can hack them on the
> side like this!) I'm guessing that's why it's falling down. So if
> I could spit out the "web-cache" database from "parsed" database I
> think it would be a little faster. I'm wonder what people think
> about this sort of design decision, and how they would suggest
> implementing it.
If you don't need to analyze the bulk-data in any view you could
consider storing it as an attachment to a doc. Details are at the
bottom of this page:
http://wiki.apache.org/couchdb/HttpDocumentApi
> Question 3: Say I eventually get something which has 2 floating
> point attributes. (e.g. lat and long.) How would I get all of the
> documents where those were in a square? i.e.
> select * from places where places.lat >= left_lat and places.lat <=
> right_lat and places.longitude >= top_longitude and places <=
> bottom_longitude;
>
> I can see how'd you'd do this with one dimension, but I'm not sure
> how you'd do it with the second. Especially since you need to make
> all these data round trips...
Yeah, I guess that's a bit tricky. If the data volume doesn't get in
your way you could emit [places.lat, places.long] as the key, query
the view with your latitude range as startkey and endkey, and then
pick out documents in your longitude range client-side. Others may
well have more clever suggestions. Best,
Adam
> Thanks in advance. I think there are a lot of very interesting
> ideas in Couchdb. It seems like a lot of this stuff can't be don't
> nearly as well as with a sql database, but I'm hoping that it's just
> me being ignorant.
>
> -w
> http://sublimeguile.com
>
>
|