incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrey Kuprianov <andrey.koupria...@gmail.com>
Subject Re: Distinct values with range
Date Mon, 15 Apr 2013 11:02:12 GMT
Btw, thank you for clarification.


On Mon, Apr 15, 2013 at 7:01 PM, Andrey Kuprianov <
andrey.kouprianov@gmail.com> wrote:

> Lists wont get faster once that view hits 1mil mark either, however,
> reduce will not grow large as number of distinct countries is finite and
> relatively small.
>
>
> On Mon, Apr 15, 2013 at 5:10 PM, Robert Newson <rnewson@apache.org> wrote:
>
>> Bounded accumulation in reduce functions is often feasible. The reason
>> we discourage custom reduces is to avoid degenerate cases like "return
>> values" or a function that combines all the items in the values around
>> into a single object. The return values of those functions continues
>> to grow  as the database grows. If your databases stays small then you
>> might well avoid the problem entirely. The reduce_limit feature is
>> designed to catch these mistakes early, before you have a
>> multi-million document database that fails.
>>
>> A list function will be slower than calling the view directly as every
>> row passes through the view server (converting from native to json
>> too). In your case, your view was already fast (at least, when you
>> only have 65k documents), so I'm not too surprised that the list
>> approach was slower. The question is whether that remains true at a
>> million documents.
>>
>> B.
>>
>>
>> On 15 April 2013 09:29, Andrey Kuprianov <andrey.kouprianov@gmail.com>
>> wrote:
>> > I feel a little bit deceived here. I was lead to believe that
>> accumulation
>> > of data in reduces will drastically slow things down, but now I am
>> having
>> > second thoughts.
>> >
>> > I've tried Jim's approach with lists and ran it against my old approach
>> > where I was using reduce without limit (over 65k documents were used in
>> the
>> > test). The reduce seems to run 20 times faster! I feel like lists are
>> > actually slowing things down, not custom reduces.
>> >
>> > Can anyone give me some good explanation regarding this?
>> >
>> > Just FYI, I am using CouchDB 1.2.0.
>> >
>> >
>> > On Mon, Apr 15, 2013 at 2:52 PM, Andrey Kuprianov <
>> > andrey.kouprianov@gmail.com> wrote:
>> >
>> >> Btw, is reduce function that you mentioned supposed to basically output
>> >> de-duplicate keys?
>> >>
>> >>
>> >> On Mon, Apr 15, 2013 at 1:10 PM, Andrey Kuprianov <
>> >> andrey.kouprianov@gmail.com> wrote:
>> >>
>> >>> Thanks. I'll try the lists. Completely forgot about them actually
>> >>>
>> >>>
>> >>>
>> >>> On Mon, Apr 15, 2013 at 12:59 PM, Jim Klo <jim.klo@sri.com> wrote:
>> >>>
>> >>>> Not sure if its ideal but if you need dates in epoch millis, you
>> could
>> >>>> round the timestamp to the floor of the current day (say midnight)
>> in a map
>> >>>> function, use a built in reduce... Then use a list function to filter
>> >>>> unique countries.
>> >>>>
>> >>>> If you don't need a real timestamp value, use an integer like
>> YYYYMMDD
>> >>>> (i.e. 20130710 for 2013-Jul-10).
>> >>>>
>> >>>> Reduce = true will combine by day making at most (196 countries
x
>> number
>> >>>> of days in range) to filter in the show function.
>> >>>>
>> >>>> - JK
>> >>>>
>> >>>>
>> >>>>
>> >>>> Sent from my iPad
>> >>>>
>> >>>> On Apr 14, 2013, at 6:38 PM, "Andrey Kuprianov" <
>> >>>> andrey.kouprianov@gmail.com> wrote:
>> >>>>
>> >>>> > Hi guys,
>> >>>> >
>> >>>> > Just for the sake of a debate. Here's the question. There are
>> >>>> transactions.
>> >>>> > Among all other attributes there's timestamp (when transaction
was
>> >>>> made; in
>> >>>> > seconds) and a country name  (from where the transaction was
made).
>> >>>> So, for
>> >>>> > instance,
>> >>>> >
>> >>>> > {
>> >>>> >    . . . .
>> >>>> >    "timestamp": 1332806400
>> >>>> >    "country_name": "Australia",
>> >>>> >    . . . .
>> >>>> > }
>> >>>> >
>> >>>> > Question is: how does one get unique / distinct country names
in
>> >>>> between
>> >>>> > dates? For example, give me all country names in between
>> 10-Jul-2010
>> >>>> and
>> >>>> > 21-Jan-2013.
>> >>>> >
>> >>>> > My solution was to write a custom reduce function and set
>> >>>> > reduce_limit=false, so that i can enumerate all countries without
>> >>>> hitting
>> >>>> > the overflow exception. It works great! However, such solutions
are
>> >>>> frowned
>> >>>> > upon by everyone around. Has anyone a better idea on how to
tackle
>> this
>> >>>> > efficiently?
>> >>>> >
>> >>>> >    Andrey
>> >>>>
>> >>>
>> >>>
>> >>
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message