lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: system design for big numbers
Date Wed, 27 Aug 2008 05:26:17 GMT

You could try the approach you described - one index per user.  When I built Simpy (see
) a few years ago I chose the same approach and I never regretted it.  The hardware behind
Simpy is very modest, usage is high, and I never had problems with too many indices open (but
you do have to keep track of them and close them when they are no longer needed).

Another approach is to put users in buckets, so you have a single index that holds data for
multiple users.  This is a very common approach not just with Lucene, but also database scaling.

With numbers you described you have to think distributed indexing, distributed search, lots
of servers, horizontal scaling, index data/user partitioning, etc.

Sematext -- -- Lucene - Solr - Nutch

----- Original Message ----
> From: Giovanni Mascia <>
> To:
> Sent: Tuesday, August 26, 2008 7:56:25 PM
> Subject: system design for big numbers
> I've been wandering for a while through this list and other Lucene
> resources on the web trying to figure out the possible outlines of a
> search solution which could fit my case. But as a Lucene newbie I
> decided to ask for your help.
> Now this is the scenario. I am building a webmail application which
> allows users to aggregate their email addresses. Every user can setup
> up to 6 mailboxes on his user account on the website. From day 1 the
> site enables the new user to start downloading emails from his
> accounts.
> Email messages and attachments are permanently stored (I am not
> describing that here, just presume that the storing side of the
> problem is solved).
> Now the task is to let users do full-text searches into their stored
> emails (forget the attachments).
> Say that an average user does receive and downloads 30 emails per day
> per account. This makes 180 mails/day, 5400 mails/month, 32400 mails
> on the first 6 months.
> Say that on day 1 we do face with 100,000 registered users and that
> this user base does not change on the 6 months timeframe (ok it's not
> realistic, it's just to put down a sort of a scale).
> We will face 100,000*32,400 = 3,240,000,000 mails, say 20KB per
> message, it's 64,800,000,000KB which makes some 60TBs (!!).
> The Lucene index would be some 20TBs (let's say that it would take the
> 30% of the space needed for the actual content, as stated elsewhere in
> this list).
> OK let's finally presume that my forecast is unrealistic or just fool,
> let's say that we will have to handle a Lucene index of only  3 or 4
> TBs.
> I was thinking that the best and simplest approach would be creating
> one index per user: every user will search only its own mail querying
> only its own index. Multiple not-too-big indexes could be easily moved
> and scaled to additional disks or machines.
> But I've read that even 100 indexes could be too much (also for OS
> limitations in concurrently keeping files open, of course).
> Maybe my scenario is not a common one, but from where would you start
> when trying to outlining a solution which in 6 months could take to a
> 3-4 TBs index?
> Browsing the web it seems that scaling or distributing Lucene indexes
> is quite a task, I have in mind too many different options for a
> possible architecture but I don't know where to start.
> As you may understand I'm not a search-specialist, I presume that in
> case of a good start of my project I will get some help or, better,
> some money for hire someone. The task is to "resist" until that time
> without creating in the first 3-6 months an un-manageable, un-scalable
> situation.
> Thanks for sharing with me your points of view.
> And I may add that the search feature is absolutely needed from the
> start, building it in a second phase is out of question.
> Giovanni
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message