incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Anderson <jch...@apache.org>
Subject Re: Is it possible to produce counts with a reduce function, then order by those counts?
Date Wed, 02 Dec 2009 18:25:36 GMT
On Wed, Dec 2, 2009 at 9:18 AM, Simon Willison
<simon.willison@guardian.co.uk> wrote:
> Hello,
>
> I've just started learning CouchDB, so apologies if this is covered in
> an FAQ (I've looked around a bit and haven't found it though).
>
> I'd like to write a view which counts the number of occurrences of a
> value across my whole document set, then returns those occurrences
> ordered by their frequency.
>
> Essentially this:
>
> http://barkingiguana.com/2009/01/28/counting-tags-with-couchdb-and-map-reduce
>
> But with an "order by count" at the end.
>
> Is this possible, or am I asking the wrong kind of question?
>

The challenge here is that CouchDB's indexes are sorted only along the
original map value. To do what you are requesting you have 3 main
options:

1) Sort the rows by value in your application. This is the simplest
option until you have a large # of distinct rows and you can't fit
them all in memory.

2) Pipe the group-reduce query into a process that saves each row as a
document in another CouchDB database. Then use a map view to sort
those documents by the group value. This is the best option if you
have lots and lots of rows in the group-reduce output. It's probably
the closest to Hadoop/Google-style chained map reduce that you'll see
with CouchDB. Of course the derived index won't be incremental with
updates to the source database.

3) My favorite: You can do something like (1) but on the server in
CouchDB's JavaScript application environment. The _list function is
fed each row of a view in turn, and can do whatever it likes. In your
case you could accumulate the rows there and sort by value. This has
the same memory-limits as (1), but since it's already setup to stream
rows, and since it already runs on the server, it's a little cleaner
and faster than what most application servers would do. (3) is ideal
if what you really want is the top N tags.

Regardless of which you chose, you'll want to cache the output somehow.

We've had discussions about having better support for sort-by-value.
It'd be nice to have built-in support for (2) so that you can
trigger/query it from a browser instead of needing your own small
program to do the transfer.

Most of the documentation for _list assumes you'll be using it to
output HTML, but it should be clear enough how you could use it for
sort-by-value with JSON output. This tweet might be a good start as
well: http://twitter.com/ianschenck/status/6257521024

some list docs: http://books.couchdb.org/relax/example-app/view-recent-posts

Hope that helps,
Chris

> Thanks,
>
> Simon Willison
> Please consider the environment before printing this email.
> ------------------------------------------------------------------
> Visit guardian.co.uk - the UK's most popular newspaper website
> http://guardian.co.uk http://observer.co.uk
>
> To save up to 33% when you subscribe to the Guardian and the Observer visit
> http://www.guardian.co.uk/subscriber
>
> ---------------------------------------------------------------------
>
> This e-mail and all attachments are confidential and may also
> be privileged. If you are not the named recipient, please notify
> the sender and delete the e-mail and all attachments immediately.
> Do not disclose the contents to another person. You may not use
> the information for any purpose, or store, or copy, it in any way.
>
> Guardian News & Media Limited is not liable for any computer
> viruses or other material transmitted with or as part of this
> e-mail. You should employ virus checking software.
>
> Guardian News & Media Limited
> A member of Guardian Media Group PLC
> Registered Office
> Number 1 Scott Place, Manchester M3 3GG
> Registered in England Number 908396
>
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Mime
View raw message