hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Radek Maciaszek <ra...@maciaszek.co.uk>
Subject Re: Unique users analysis
Date Wed, 07 Sep 2011 08:21:37 GMT
Hello,

In case anyone will be looking for a similar solution in future I put a
short blog post on this subject:
http://www.dataminelab.com/blog/calculating-unique-visitors-in-hadoop-and-hive/

Best,
Radek

On 14 January 2011 12:50, Radek Maciaszek <radek@maciaszek.co.uk> wrote:

> Hi Itai,
>
> I did not think about sampling users instead of sampling records, but it
> makes a much more sense indeed.
> As it happens my ID is also hexadecimal and so I did exactly what you
> suggested. In the results my error is less than 1% comparing to observed
> values!
>
> Many thanks!!
> Radek
>
>
> On 14 January 2011 11:32, Itai Hochman <itai@outbrain.com> wrote:
>
>> We had a similar challenge and we dealt with it  by sampling based on the
>> user id.
>>
>> We have a unique id which is a random hexadecimal format- for instance
>> A12890900.
>>
>> The query is running  only on users that have an id that ends with 00.
>>
>> At the end we multiply by 256 and get a pretty close number to the real
>> number.
>>
>>
>>
>> Itai
>>
>>
>>
>> On 01/14/2011 01:14 PM, Radek Maciaszek wrote:
>>
>>  Hi,
>>>
>>> I am working on some large scale unique users analysis (think hundreds of
>>> millions of records per day). Since number of all records per month goes
>>> into many billions I am hoping that there may be some alternative to running
>>> "SELECT DISTINCT user_unique_id..." such as sampling data or perhaps
>>> deriving monthly numbers based on daily numbers of unique users with a use
>>> of statistical linear regression or a similar technique.
>>>
>>> I tried to use sampling but that does not seem to give me the results I
>>> would expect. That is, for example if I sample every 1/100 record I would
>>> expect to see about 1% of the total number of unique users from a given
>>> period of time but instead I am seeing bigger numbers, for example something
>>> like 5%. I am not sure if the law of large number applies to unique users
>>> analysis.
>>>
>>> Moreover it seems that "select distinct" does not scale linearly (which
>>> is understandable) and so doing monthly unique users analysis takes way
>>> longer than 30 x time to perform daily unique users analysis and requires a
>>> lot of memory.
>>>
>>> I was wondering if anyone had a similar challenge?
>>>
>>> Many thanks,
>>> Radek
>>>
>>
>>
>
>
>

Mime
View raw message