incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrey Kuprianov <andrey.koupria...@gmail.com>
Subject Re: Distinct values with range
Date Tue, 16 Apr 2013 11:50:00 GMT
We've solved the problem using Jim's approach, but at a small cost: we had
to round up dates to the beginning of each month (not day, as he
suggested). So when we ran a reduce with grouping, then output of a view
shortened to a much smaller amount of rows, which we then fed down to list
function, which in return collated countries and returned them.

Basically now we are limited to making queries on a per month basis, but
that's fine in our case. As for benchmark, this way of doing it proved to
be very fast. Thanks everyone!



On Tue, Apr 16, 2013 at 7:23 PM, muji <freeformsystems@gmail.com> wrote:

> That depends upon your requirements and data. If the requirement is to find
> data across *any* date range then it will potentially be slow, however if
> you only every need to be able to query with a maximum date range of say a
> year (ie, your date ranges do not go over a year) then you could use
> filtered replication to create databases containing only the entries for a
> specific year.
>
> Still not sure if that helps you.
>
> I am working on the same problem(s) with an analytics application that uses
> couchdb and luckily for me the client reports for any date range only need
> to be generated once. Realtime analysis is for *now* (today) otherwise run
> a separate process to generate reports for the given date range and then
> subsequently return the (cached) generated report.
>
> I am not sure I completely understand your use case, but you may want to
> consider caching result sets for dates in the past, maybe in redis? After
> all, once the date has expired the data is fixed right? Or not?
>
>
> On 16 April 2013 11:50, Andrey Kuprianov <andrey.kouprianov@gmail.com
> >wrote:
>
> > Muji, what happens if you have several hundred transactions per day in a
> > variety of different countries over several years? Then your view
> > processing is going to be very slow. We are looking for a near real-time
> > solution
> >
> >
> > On Tue, Apr 16, 2013 at 5:42 PM, muji <freeformsystems@gmail.com> wrote:
> >
> > > I believe you need to query with startkey and endkey as complex keys
> > > (assuming YYYY-MM-DD):
> > >
> > > startkey=[startyear,startmonth,startday]
> > > endkey=[endyear,endmonth,endday,{}]
> > >
> > > Then you can extract the countries from the key returned with each row
> > (it
> > > will be the last element in the array). You will also need to set the
> > group
> > > view parameter (group_level=4?) for distinct values.
> > >
> > > Then you should not need to write a custom reduce function.
> > >
> > > The startkey and endkey must be proper JSON (and URL) encoded values.
> > >
> > > My understanding is that is the correct approach.
> > >
> > > Cheers!
> > >
> > >
> > > On 16 April 2013 05:46, Andrey Kuprianov <andrey.kouprianov@gmail.com
> > > >wrote:
> > >
> > > > Nope, I need distinct values over a period of time. Not per day.
> > > >
> > > >
> > > > On Tue, Apr 16, 2013 at 11:30 AM, Keith Gable <
> > > ziggy@ignition-project.com
> > > > >wrote:
> > > >
> > > > > It gives you distinct countries per day. Is that not what you want?
> > > With
> > > > > reduce, it should be really fast once the view is built.
> > > > > On Apr 15, 2013 9:05 PM, "Andrey Kuprianov" <
> > > andrey.kouprianov@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > @Keith your method will not give me distinct countries and even
> > with
> > > > > reduce
> > > > > > and after being fed to list function it's still slow
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Apr 16, 2013 at 2:27 AM, Wendall Cada <
> wendallc@apache.org
> > >
> > > > > wrote:
> > > > > >
> > > > > > > I agree with this approach. I do something similar using
_sum:
> > > > > > >
> > > > > > > emit([doc.country_name, toDay(doc.timestamp)], 1);
> > > > > > >
> > > > > > > The toDay() method is basically a floor of the day value.
> Since I
> > > > don't
> > > > > > > store ts in UTC (Because of an idiotic error some years
back) I
> > > also
> > > > > do a
> > > > > > > tz offset to correct the day value in my toDay() method.
> > > > > > >
> > > > > > > Using reduce is by far the fastest method for this. I don't
see
> > any
> > > > > issue
> > > > > > > with getting this to scale.
> > > > > > >
> > > > > > > Overall, I think I rather prefer the method Keith shows,
as it
> > > would
> > > > > > > depend on the values returned in the date object versus
other
> > > > possibly
> > > > > > > inaccurate means using math.
> > > > > > >
> > > > > > > Wendall
> > > > > > >
> > > > > > >
> > > > > > > On 04/15/2013 07:18 AM, Keith Gable wrote:
> > > > > > >
> > > > > > >> Output keys like so:
> > > > > > >>
> > > > > > >> [2010, 7, 10, "Australia"]
> > > > > > >>
> > > > > > >> Reduce function would be _count.
> > > > > > >>
> > > > > > >> startkey=[year,month,day,null]
> > > > > > >> endkey=[year,month,day,{}]
> > > > > > >>
> > > > > > >> ---
> > > > > > >> Keith Gable
> > > > > > >> A+, Network+, and Storage+ Certified Professional
> > > > > > >> Apple Certified Technical Coordinator
> > > > > > >> Mobile Application Developer / Web Developer
> > > > > > >>
> > > > > > >>
> > > > > > >> On Sun, Apr 14, 2013 at 8:37 PM, Andrey Kuprianov <
> > > > > > >> andrey.kouprianov@gmail.com> wrote:
> > > > > > >>
> > > > > > >>  Hi guys,
> > > > > > >>>
> > > > > > >>> Just for the sake of a debate. Here's the question.
There are
> > > > > > >>> transactions.
> > > > > > >>> Among all other attributes there's timestamp (when
> transaction
> > > was
> > > > > > made;
> > > > > > >>> in
> > > > > > >>> seconds) and a country name  (from where the transaction
was
> > > made).
> > > > > So,
> > > > > > >>> for
> > > > > > >>> instance,
> > > > > > >>>
> > > > > > >>> {
> > > > > > >>>      . . . .
> > > > > > >>>      "timestamp": 1332806400
> > > > > > >>>      "country_name": "Australia",
> > > > > > >>>      . . . .
> > > > > > >>> }
> > > > > > >>>
> > > > > > >>> Question is: how does one get unique / distinct
country names
> > in
> > > > > > between
> > > > > > >>> dates? For example, give me all country names in
between
> > > > 10-Jul-2010
> > > > > > and
> > > > > > >>> 21-Jan-2013.
> > > > > > >>>
> > > > > > >>> My solution was to write a custom reduce function
and set
> > > > > > >>> reduce_limit=false, so that i can enumerate all
countries
> > without
> > > > > > hitting
> > > > > > >>> the overflow exception. It works great! However,
such
> solutions
> > > are
> > > > > > >>> frowned
> > > > > > >>> upon by everyone around. Has anyone a better idea
on how to
> > > tackle
> > > > > this
> > > > > > >>> efficiently?
> > > > > > >>>
> > > > > > >>>      Andrey
> > > > > > >>>
> > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > mischa (aka muji).
> > >
> >
>
>
>
> --
> mischa (aka muji).
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message