hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Radek Maciaszek <ra...@maciaszek.co.uk>
Subject Re: Unique users analysis
Date Fri, 14 Jan 2011 12:50:50 GMT
Hi Itai,

I did not think about sampling users instead of sampling records, but it
makes a much more sense indeed.
As it happens my ID is also hexadecimal and so I did exactly what you
suggested. In the results my error is less than 1% comparing to observed
values!

Many thanks!!
Radek

On 14 January 2011 11:32, Itai Hochman <itai@outbrain.com> wrote:

> We had a similar challenge and we dealt with it  by sampling based on the
> user id.
>
> We have a unique id which is a random hexadecimal format- for instance
> A12890900.
>
> The query is running  only on users that have an id that ends with 00.
>
> At the end we multiply by 256 and get a pretty close number to the real
> number.
>
>
>
> Itai
>
>
>
> On 01/14/2011 01:14 PM, Radek Maciaszek wrote:
>
>  Hi,
>>
>> I am working on some large scale unique users analysis (think hundreds of
>> millions of records per day). Since number of all records per month goes
>> into many billions I am hoping that there may be some alternative to running
>> "SELECT DISTINCT user_unique_id..." such as sampling data or perhaps
>> deriving monthly numbers based on daily numbers of unique users with a use
>> of statistical linear regression or a similar technique.
>>
>> I tried to use sampling but that does not seem to give me the results I
>> would expect. That is, for example if I sample every 1/100 record I would
>> expect to see about 1% of the total number of unique users from a given
>> period of time but instead I am seeing bigger numbers, for example something
>> like 5%. I am not sure if the law of large number applies to unique users
>> analysis.
>>
>> Moreover it seems that "select distinct" does not scale linearly (which is
>> understandable) and so doing monthly unique users analysis takes way longer
>> than 30 x time to perform daily unique users analysis and requires a lot of
>> memory.
>>
>> I was wondering if anyone had a similar challenge?
>>
>> Many thanks,
>> Radek
>>
>
>

Mime
View raw message