lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: commit often and lot of data cost too much?
Date Wed, 01 Apr 2009 18:11:36 GMT
Funny you should ask.

We had a similar problem at Veoh.  I think that this kind of problem is
relatively common.

Taking video viewing as the poster child, the meta-data about videos comes
in a couple of flavors:

The title and description and publisher info
A pointer to the actual video bits
The view counts
Rating data
The history of who viewed the video (for recommendation systems and such)
Stats about how people play the video (just 4 seconds?  all the way
through?  From the primary interface?  From embedded references?)

We can categorize this data on a couple of different axes.  One is update
rate:

Some of this data is very rarely updated (the video pointer and publisher
info).
Some is updated more commonly, but still pretty rarely (title and
description)
Some is updated fairly often (ratings)
And some is updated ALL the time (view counts especially, but view history
and view stats as well)

Another categorization is based on how you plan to search the data:

Title and description and length and publisher and date published (users
searching using the search box and advanced search)
Play history and ratings (recommendation systems doing off-line analysis)
Not usually searched (encoding, number of audio tracks, size in bytes and so
on)

Usually, high volume sites have to store data differently depending on size,
change-rate and purpose.  Then you abstract different search and storage
decisions with an access layer.  For what you are doing, you should put into
lucene only those things which are low change rate and which must be
searched.  You should put high mutation rate data into something like
memcache with some persistent back-store.  Very large data items such as the
video itself should be in an entirely different kind of store (at Veoh we
used a very heavily hacked version of danga's mogile).

Your two phase update trick will work reasonably well in the short-term, but
if your traffic is growing quickly it won't last very long because the full
update will be so nasty.

On Wed, Apr 1, 2009 at 1:06 AM, sunnyfr <johanna.34@gmail.com> wrote:

>
> Yep but we won't change the system now :(
> Or maybe I can have two kinds of schema ?
> One which is the new video during the day so just new datas and the other
> one by night which update all caracteristic of videos ?  full update
> nightly
> and light new update during the day ?
> what do you think ??
> Because the other caracteristics are not that important but used for
> filters, most view, comment ...
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message