incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrey Kuprianov <andrey.koupria...@gmail.com>
Subject Re: Distinct values with range
Date Mon, 15 Apr 2013 11:01:49 GMT
Lists wont get faster once that view hits 1mil mark either, however, reduce
will not grow large as number of distinct countries is finite and
relatively small.


On Mon, Apr 15, 2013 at 5:10 PM, Robert Newson <rnewson@apache.org> wrote:

> Bounded accumulation in reduce functions is often feasible. The reason
> we discourage custom reduces is to avoid degenerate cases like "return
> values" or a function that combines all the items in the values around
> into a single object. The return values of those functions continues
> to grow  as the database grows. If your databases stays small then you
> might well avoid the problem entirely. The reduce_limit feature is
> designed to catch these mistakes early, before you have a
> multi-million document database that fails.
>
> A list function will be slower than calling the view directly as every
> row passes through the view server (converting from native to json
> too). In your case, your view was already fast (at least, when you
> only have 65k documents), so I'm not too surprised that the list
> approach was slower. The question is whether that remains true at a
> million documents.
>
> B.
>
>
> On 15 April 2013 09:29, Andrey Kuprianov <andrey.kouprianov@gmail.com>
> wrote:
> > I feel a little bit deceived here. I was lead to believe that
> accumulation
> > of data in reduces will drastically slow things down, but now I am having
> > second thoughts.
> >
> > I've tried Jim's approach with lists and ran it against my old approach
> > where I was using reduce without limit (over 65k documents were used in
> the
> > test). The reduce seems to run 20 times faster! I feel like lists are
> > actually slowing things down, not custom reduces.
> >
> > Can anyone give me some good explanation regarding this?
> >
> > Just FYI, I am using CouchDB 1.2.0.
> >
> >
> > On Mon, Apr 15, 2013 at 2:52 PM, Andrey Kuprianov <
> > andrey.kouprianov@gmail.com> wrote:
> >
> >> Btw, is reduce function that you mentioned supposed to basically output
> >> de-duplicate keys?
> >>
> >>
> >> On Mon, Apr 15, 2013 at 1:10 PM, Andrey Kuprianov <
> >> andrey.kouprianov@gmail.com> wrote:
> >>
> >>> Thanks. I'll try the lists. Completely forgot about them actually
> >>>
> >>>
> >>>
> >>> On Mon, Apr 15, 2013 at 12:59 PM, Jim Klo <jim.klo@sri.com> wrote:
> >>>
> >>>> Not sure if its ideal but if you need dates in epoch millis, you could
> >>>> round the timestamp to the floor of the current day (say midnight) in
> a map
> >>>> function, use a built in reduce... Then use a list function to filter
> >>>> unique countries.
> >>>>
> >>>> If you don't need a real timestamp value, use an integer like YYYYMMDD
> >>>> (i.e. 20130710 for 2013-Jul-10).
> >>>>
> >>>> Reduce = true will combine by day making at most (196 countries x
> number
> >>>> of days in range) to filter in the show function.
> >>>>
> >>>> - JK
> >>>>
> >>>>
> >>>>
> >>>> Sent from my iPad
> >>>>
> >>>> On Apr 14, 2013, at 6:38 PM, "Andrey Kuprianov" <
> >>>> andrey.kouprianov@gmail.com> wrote:
> >>>>
> >>>> > Hi guys,
> >>>> >
> >>>> > Just for the sake of a debate. Here's the question. There are
> >>>> transactions.
> >>>> > Among all other attributes there's timestamp (when transaction
was
> >>>> made; in
> >>>> > seconds) and a country name  (from where the transaction was made).
> >>>> So, for
> >>>> > instance,
> >>>> >
> >>>> > {
> >>>> >    . . . .
> >>>> >    "timestamp": 1332806400
> >>>> >    "country_name": "Australia",
> >>>> >    . . . .
> >>>> > }
> >>>> >
> >>>> > Question is: how does one get unique / distinct country names in
> >>>> between
> >>>> > dates? For example, give me all country names in between 10-Jul-2010
> >>>> and
> >>>> > 21-Jan-2013.
> >>>> >
> >>>> > My solution was to write a custom reduce function and set
> >>>> > reduce_limit=false, so that i can enumerate all countries without
> >>>> hitting
> >>>> > the overflow exception. It works great! However, such solutions
are
> >>>> frowned
> >>>> > upon by everyone around. Has anyone a better idea on how to tackle
> this
> >>>> > efficiently?
> >>>> >
> >>>> >    Andrey
> >>>>
> >>>
> >>>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message