incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Newson <rnew...@apache.org>
Subject Re: Distinct values with range
Date Mon, 15 Apr 2013 09:10:44 GMT
Bounded accumulation in reduce functions is often feasible. The reason
we discourage custom reduces is to avoid degenerate cases like "return
values" or a function that combines all the items in the values around
into a single object. The return values of those functions continues
to grow  as the database grows. If your databases stays small then you
might well avoid the problem entirely. The reduce_limit feature is
designed to catch these mistakes early, before you have a
multi-million document database that fails.

A list function will be slower than calling the view directly as every
row passes through the view server (converting from native to json
too). In your case, your view was already fast (at least, when you
only have 65k documents), so I'm not too surprised that the list
approach was slower. The question is whether that remains true at a
million documents.

B.


On 15 April 2013 09:29, Andrey Kuprianov <andrey.kouprianov@gmail.com> wrote:
> I feel a little bit deceived here. I was lead to believe that accumulation
> of data in reduces will drastically slow things down, but now I am having
> second thoughts.
>
> I've tried Jim's approach with lists and ran it against my old approach
> where I was using reduce without limit (over 65k documents were used in the
> test). The reduce seems to run 20 times faster! I feel like lists are
> actually slowing things down, not custom reduces.
>
> Can anyone give me some good explanation regarding this?
>
> Just FYI, I am using CouchDB 1.2.0.
>
>
> On Mon, Apr 15, 2013 at 2:52 PM, Andrey Kuprianov <
> andrey.kouprianov@gmail.com> wrote:
>
>> Btw, is reduce function that you mentioned supposed to basically output
>> de-duplicate keys?
>>
>>
>> On Mon, Apr 15, 2013 at 1:10 PM, Andrey Kuprianov <
>> andrey.kouprianov@gmail.com> wrote:
>>
>>> Thanks. I'll try the lists. Completely forgot about them actually
>>>
>>>
>>>
>>> On Mon, Apr 15, 2013 at 12:59 PM, Jim Klo <jim.klo@sri.com> wrote:
>>>
>>>> Not sure if its ideal but if you need dates in epoch millis, you could
>>>> round the timestamp to the floor of the current day (say midnight) in a map
>>>> function, use a built in reduce... Then use a list function to filter
>>>> unique countries.
>>>>
>>>> If you don't need a real timestamp value, use an integer like YYYYMMDD
>>>> (i.e. 20130710 for 2013-Jul-10).
>>>>
>>>> Reduce = true will combine by day making at most (196 countries x number
>>>> of days in range) to filter in the show function.
>>>>
>>>> - JK
>>>>
>>>>
>>>>
>>>> Sent from my iPad
>>>>
>>>> On Apr 14, 2013, at 6:38 PM, "Andrey Kuprianov" <
>>>> andrey.kouprianov@gmail.com> wrote:
>>>>
>>>> > Hi guys,
>>>> >
>>>> > Just for the sake of a debate. Here's the question. There are
>>>> transactions.
>>>> > Among all other attributes there's timestamp (when transaction was
>>>> made; in
>>>> > seconds) and a country name  (from where the transaction was made).
>>>> So, for
>>>> > instance,
>>>> >
>>>> > {
>>>> >    . . . .
>>>> >    "timestamp": 1332806400
>>>> >    "country_name": "Australia",
>>>> >    . . . .
>>>> > }
>>>> >
>>>> > Question is: how does one get unique / distinct country names in
>>>> between
>>>> > dates? For example, give me all country names in between 10-Jul-2010
>>>> and
>>>> > 21-Jan-2013.
>>>> >
>>>> > My solution was to write a custom reduce function and set
>>>> > reduce_limit=false, so that i can enumerate all countries without
>>>> hitting
>>>> > the overflow exception. It works great! However, such solutions are
>>>> frowned
>>>> > upon by everyone around. Has anyone a better idea on how to tackle this
>>>> > efficiently?
>>>> >
>>>> >    Andrey
>>>>
>>>
>>>
>>

Mime
View raw message