incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From muji <freeformsyst...@gmail.com>
Subject Re: Distinct values with range
Date Tue, 16 Apr 2013 11:23:22 GMT
That depends upon your requirements and data. If the requirement is to find
data across *any* date range then it will potentially be slow, however if
you only every need to be able to query with a maximum date range of say a
year (ie, your date ranges do not go over a year) then you could use
filtered replication to create databases containing only the entries for a
specific year.

Still not sure if that helps you.

I am working on the same problem(s) with an analytics application that uses
couchdb and luckily for me the client reports for any date range only need
to be generated once. Realtime analysis is for *now* (today) otherwise run
a separate process to generate reports for the given date range and then
subsequently return the (cached) generated report.

I am not sure I completely understand your use case, but you may want to
consider caching result sets for dates in the past, maybe in redis? After
all, once the date has expired the data is fixed right? Or not?


On 16 April 2013 11:50, Andrey Kuprianov <andrey.kouprianov@gmail.com>wrote:

> Muji, what happens if you have several hundred transactions per day in a
> variety of different countries over several years? Then your view
> processing is going to be very slow. We are looking for a near real-time
> solution
>
>
> On Tue, Apr 16, 2013 at 5:42 PM, muji <freeformsystems@gmail.com> wrote:
>
> > I believe you need to query with startkey and endkey as complex keys
> > (assuming YYYY-MM-DD):
> >
> > startkey=[startyear,startmonth,startday]
> > endkey=[endyear,endmonth,endday,{}]
> >
> > Then you can extract the countries from the key returned with each row
> (it
> > will be the last element in the array). You will also need to set the
> group
> > view parameter (group_level=4?) for distinct values.
> >
> > Then you should not need to write a custom reduce function.
> >
> > The startkey and endkey must be proper JSON (and URL) encoded values.
> >
> > My understanding is that is the correct approach.
> >
> > Cheers!
> >
> >
> > On 16 April 2013 05:46, Andrey Kuprianov <andrey.kouprianov@gmail.com
> > >wrote:
> >
> > > Nope, I need distinct values over a period of time. Not per day.
> > >
> > >
> > > On Tue, Apr 16, 2013 at 11:30 AM, Keith Gable <
> > ziggy@ignition-project.com
> > > >wrote:
> > >
> > > > It gives you distinct countries per day. Is that not what you want?
> > With
> > > > reduce, it should be really fast once the view is built.
> > > > On Apr 15, 2013 9:05 PM, "Andrey Kuprianov" <
> > andrey.kouprianov@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > @Keith your method will not give me distinct countries and even
> with
> > > > reduce
> > > > > and after being fed to list function it's still slow
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Apr 16, 2013 at 2:27 AM, Wendall Cada <wendallc@apache.org
> >
> > > > wrote:
> > > > >
> > > > > > I agree with this approach. I do something similar using _sum:
> > > > > >
> > > > > > emit([doc.country_name, toDay(doc.timestamp)], 1);
> > > > > >
> > > > > > The toDay() method is basically a floor of the day value. Since
I
> > > don't
> > > > > > store ts in UTC (Because of an idiotic error some years back)
I
> > also
> > > > do a
> > > > > > tz offset to correct the day value in my toDay() method.
> > > > > >
> > > > > > Using reduce is by far the fastest method for this. I don't
see
> any
> > > > issue
> > > > > > with getting this to scale.
> > > > > >
> > > > > > Overall, I think I rather prefer the method Keith shows, as
it
> > would
> > > > > > depend on the values returned in the date object versus other
> > > possibly
> > > > > > inaccurate means using math.
> > > > > >
> > > > > > Wendall
> > > > > >
> > > > > >
> > > > > > On 04/15/2013 07:18 AM, Keith Gable wrote:
> > > > > >
> > > > > >> Output keys like so:
> > > > > >>
> > > > > >> [2010, 7, 10, "Australia"]
> > > > > >>
> > > > > >> Reduce function would be _count.
> > > > > >>
> > > > > >> startkey=[year,month,day,null]
> > > > > >> endkey=[year,month,day,{}]
> > > > > >>
> > > > > >> ---
> > > > > >> Keith Gable
> > > > > >> A+, Network+, and Storage+ Certified Professional
> > > > > >> Apple Certified Technical Coordinator
> > > > > >> Mobile Application Developer / Web Developer
> > > > > >>
> > > > > >>
> > > > > >> On Sun, Apr 14, 2013 at 8:37 PM, Andrey Kuprianov <
> > > > > >> andrey.kouprianov@gmail.com> wrote:
> > > > > >>
> > > > > >>  Hi guys,
> > > > > >>>
> > > > > >>> Just for the sake of a debate. Here's the question.
There are
> > > > > >>> transactions.
> > > > > >>> Among all other attributes there's timestamp (when transaction
> > was
> > > > > made;
> > > > > >>> in
> > > > > >>> seconds) and a country name  (from where the transaction
was
> > made).
> > > > So,
> > > > > >>> for
> > > > > >>> instance,
> > > > > >>>
> > > > > >>> {
> > > > > >>>      . . . .
> > > > > >>>      "timestamp": 1332806400
> > > > > >>>      "country_name": "Australia",
> > > > > >>>      . . . .
> > > > > >>> }
> > > > > >>>
> > > > > >>> Question is: how does one get unique / distinct country
names
> in
> > > > > between
> > > > > >>> dates? For example, give me all country names in between
> > > 10-Jul-2010
> > > > > and
> > > > > >>> 21-Jan-2013.
> > > > > >>>
> > > > > >>> My solution was to write a custom reduce function and
set
> > > > > >>> reduce_limit=false, so that i can enumerate all countries
> without
> > > > > hitting
> > > > > >>> the overflow exception. It works great! However, such
solutions
> > are
> > > > > >>> frowned
> > > > > >>> upon by everyone around. Has anyone a better idea on
how to
> > tackle
> > > > this
> > > > > >>> efficiently?
> > > > > >>>
> > > > > >>>      Andrey
> > > > > >>>
> > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > mischa (aka muji).
> >
>



-- 
mischa (aka muji).

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message