Hi Ted,
Thank you so much for quick reply and it is very informative. Kindly find my reply below.
Regarding deviation, Yes. Like, we can compute the deviation by computing the difference of
new score with past score (which is mean of past few days of data).
Regarding variations on week boundaries having less impact, I am sorry. I couldn't get how
it will be less impact?. I am assuming that week boundaries means the mean of past 7 days
of data?
In order to make sure that I understood correctly, I am reiterating what you said. So, we
compute the occurrences for all items and then rank top 10k or 100k (Some N items). And, the
rest would be given rank as N+1. Now, we compute the score for each item as log (r_new/r_old)
(or is it log(r_old/r_new) as you specified below?). r_old being the rank for previous time
frame (week, day or hour etc). Now, we sort based on scores and we can get the items which
are varying heavily. Kindly correct me if I am wrong.
Also, when you mean k, is it number of occurrences?
Thanks
Pallavi
Original Message
From: Ted Dunning [mailto:ted.dunning@gmail.com]
Sent: Thursday, September 24, 2009 10:55 AM
To: mahoutuser@lucene.apache.org
Subject: Re: Trending and prediction Analysis
When you say deviation, do you mean deviation in rate of something that is
nominally Poisson distributed (after accounting for diurnal and weekly
variations)?
If you can do the computation on week boundaries, then these variations have
less effect.
Alternatively, if you base the computation on rank, it is also much easier.
For the rankbased computation, if you assume Zipfian distribution, then the
cumulative distribution is close to log r (r = rank). This is typically
accurate for common web applications except for the top 1% or so of items.
This implies that the quantity
score = log (r_old / r_new)
is a good measure of the significance of the change for an item. This is
close to what we used at Veoh with very good results. We tried a number of
more complicated approaches and this was one of the best. Certainly, it was
the best simple measure. You could equivalently use log(k_new / k_old),
but using ranks self corrects for most frequency variations which is really
nice.
If all this is good enough for you, then you don't need anything fancy at
all. Just count occurrences using any technique you like, rank the top
umpteen of these (10,000 to 100,000 is plenty for most purposes). Then
compare the scores from period to period.
One gotcha is how to handle items that are brand new. The easiest answer is
to give them a rank just beyond anything you retain counts for. Thus, if
you keep 10,000 counts, any item that doesn't appear is given a rank of
10,001.
Was this even what you wanted?
On Wed, Sep 23, 2009 at 9:19 PM, Palleti, Pallavi <
pallavi.palleti@corp.aol.com> wrote:
> Hi Ted,
>
> I am especially looking for efficient algorithms for computing trends
> (Similar to Google's hottrends). The simple approach to start with it is to
> compute the deviation of item(URL/query) from its current score. I'm
> exploring if I can do the same more efficiently and effectively.
>
> Thanks
> Pallavi
>
> Original Message
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Wednesday, September 23, 2009 10:29 PM
> To: mahoutuser@lucene.apache.org
> Subject: Re: Trending and prediction Analysis
>
> Can you say just a bit more about what you want?
>
> On Wed, Sep 23, 2009 at 9:45 AM, Palleti, Pallavi <
> pallavi.palleti@corp.aol.com> wrote:
>
> > Hi all,
> >
> > I am currently exploring trending and prediction analysis algorithms.
> > Kindly refer any good work/paper that you have come across.
> >
> > Thanks
> > Pallavi
> >
> >
>
>
> 
> Ted Dunning, CTO
> DeepDyve
>

Ted Dunning, CTO
DeepDyve
