couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Newson <rnew...@apache.org>
Subject Re: Distinct values with range
Date Mon, 15 Apr 2013 11:15:56 GMT
The list function will be a constant factor slower than the equivalent
view call. It would be a different mistake to read the entire view
through a list function while performing some kind of aggregation in
the list function.

B.

On 15 April 2013 12:02, Andrey Kuprianov <andrey.kouprianov@gmail.com> wrote:
> Btw, thank you for clarification.
>
>
> On Mon, Apr 15, 2013 at 7:01 PM, Andrey Kuprianov <
> andrey.kouprianov@gmail.com> wrote:
>
>> Lists wont get faster once that view hits 1mil mark either, however,
>> reduce will not grow large as number of distinct countries is finite and
>> relatively small.
>>
>>
>> On Mon, Apr 15, 2013 at 5:10 PM, Robert Newson <rnewson@apache.org> wrote:
>>
>>> Bounded accumulation in reduce functions is often feasible. The reason
>>> we discourage custom reduces is to avoid degenerate cases like "return
>>> values" or a function that combines all the items in the values around
>>> into a single object. The return values of those functions continues
>>> to grow  as the database grows. If your databases stays small then you
>>> might well avoid the problem entirely. The reduce_limit feature is
>>> designed to catch these mistakes early, before you have a
>>> multi-million document database that fails.
>>>
>>> A list function will be slower than calling the view directly as every
>>> row passes through the view server (converting from native to json
>>> too). In your case, your view was already fast (at least, when you
>>> only have 65k documents), so I'm not too surprised that the list
>>> approach was slower. The question is whether that remains true at a
>>> million documents.
>>>
>>> B.
>>>
>>>
>>> On 15 April 2013 09:29, Andrey Kuprianov <andrey.kouprianov@gmail.com>
>>> wrote:
>>> > I feel a little bit deceived here. I was lead to believe that
>>> accumulation
>>> > of data in reduces will drastically slow things down, but now I am
>>> having
>>> > second thoughts.
>>> >
>>> > I've tried Jim's approach with lists and ran it against my old approach
>>> > where I was using reduce without limit (over 65k documents were used in
>>> the
>>> > test). The reduce seems to run 20 times faster! I feel like lists are
>>> > actually slowing things down, not custom reduces.
>>> >
>>> > Can anyone give me some good explanation regarding this?
>>> >
>>> > Just FYI, I am using CouchDB 1.2.0.
>>> >
>>> >
>>> > On Mon, Apr 15, 2013 at 2:52 PM, Andrey Kuprianov <
>>> > andrey.kouprianov@gmail.com> wrote:
>>> >
>>> >> Btw, is reduce function that you mentioned supposed to basically output
>>> >> de-duplicate keys?
>>> >>
>>> >>
>>> >> On Mon, Apr 15, 2013 at 1:10 PM, Andrey Kuprianov <
>>> >> andrey.kouprianov@gmail.com> wrote:
>>> >>
>>> >>> Thanks. I'll try the lists. Completely forgot about them actually
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Mon, Apr 15, 2013 at 12:59 PM, Jim Klo <jim.klo@sri.com>
wrote:
>>> >>>
>>> >>>> Not sure if its ideal but if you need dates in epoch millis,
you
>>> could
>>> >>>> round the timestamp to the floor of the current day (say midnight)
>>> in a map
>>> >>>> function, use a built in reduce... Then use a list function
to filter
>>> >>>> unique countries.
>>> >>>>
>>> >>>> If you don't need a real timestamp value, use an integer like
>>> YYYYMMDD
>>> >>>> (i.e. 20130710 for 2013-Jul-10).
>>> >>>>
>>> >>>> Reduce = true will combine by day making at most (196 countries
x
>>> number
>>> >>>> of days in range) to filter in the show function.
>>> >>>>
>>> >>>> - JK
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> Sent from my iPad
>>> >>>>
>>> >>>> On Apr 14, 2013, at 6:38 PM, "Andrey Kuprianov" <
>>> >>>> andrey.kouprianov@gmail.com> wrote:
>>> >>>>
>>> >>>> > Hi guys,
>>> >>>> >
>>> >>>> > Just for the sake of a debate. Here's the question. There
are
>>> >>>> transactions.
>>> >>>> > Among all other attributes there's timestamp (when transaction
was
>>> >>>> made; in
>>> >>>> > seconds) and a country name  (from where the transaction
was made).
>>> >>>> So, for
>>> >>>> > instance,
>>> >>>> >
>>> >>>> > {
>>> >>>> >    . . . .
>>> >>>> >    "timestamp": 1332806400
>>> >>>> >    "country_name": "Australia",
>>> >>>> >    . . . .
>>> >>>> > }
>>> >>>> >
>>> >>>> > Question is: how does one get unique / distinct country
names in
>>> >>>> between
>>> >>>> > dates? For example, give me all country names in between
>>> 10-Jul-2010
>>> >>>> and
>>> >>>> > 21-Jan-2013.
>>> >>>> >
>>> >>>> > My solution was to write a custom reduce function and set
>>> >>>> > reduce_limit=false, so that i can enumerate all countries
without
>>> >>>> hitting
>>> >>>> > the overflow exception. It works great! However, such solutions
are
>>> >>>> frowned
>>> >>>> > upon by everyone around. Has anyone a better idea on how
to tackle
>>> this
>>> >>>> > efficiently?
>>> >>>> >
>>> >>>> >    Andrey
>>> >>>>
>>> >>>
>>> >>>
>>> >>
>>>
>>
>>

Mime
View raw message