lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Polites" <>
Subject Re: 30 milllion+ docs on a single server
Date Sat, 12 Aug 2006 16:22:57 GMT
Sounds like you're a bit frustrated.  Cheer up, the simple fact is that
engineering and business rarely see eye-to-eye.  Just focus on the fact that
what you have learnt from the process will help you, and they paid for it ;)

On the issue at hand...Lucene should scale to this level, but you need a
good architecture behind it.  Google has good indexing tech, but it's their
architecture that allows them to spread the index across thousands of
servers which really gives it grunt (to the point that they invented their
own RAID-style file system).

Just think very carefully about the architecture underpinning the index.
Lucene is core-tech.  It's up to you to provide the framework to make it

On 8/12/06, Mark Miller <> wrote:
> Tomi NA wrote:
> > On 8/12/06, Mark Miller <> wrote:
> >> I've made a nice little archive application with lucene. I made it to
> >> handle our largest need: 2.5 million docs or so on a single server. Now
> >> the powers that be say: lets use it for a 30+ million document archive
> >> on a single server! (each doc size maybe 10k small as a 1 or
> >> 2k) Please tell me why we are in trouble...please tell me why we are
> >> not. I have tested up to 2 million docs without much trouble but 30
> >> million...the average search will include a sort on a field as
> >> well...can I search 30+ million docs with a sort? Man am I worried
> about
> >> that. Maybe the server will have 8 procs and 12 billion gigs of RAM.
> >> Mabye. Even still, Tomcat seems to be able to launch with a max of 1.5
> >> or 1.6 gig of Ram in Windows. What do you think? 30 million+ sounds
> like
> >> too much of a load to me for a single server. Not that they care what I
> >> think...I only wrote the thing (man I hate my job, offer me a new one
> :)
> >> )...please...comments?
> >>
> >> Cheers,
> >>
> >> Miserable Mark
> >
> > I don't really understand what you're so worried about. Either it'll
> > work well with the setup you have, or it won't. It's really the size
> > of it. ;)
> > Seriously, you have a number of relatively cheap possibilities at hand
> > to improve search performance: storing the index on a RAID 5 disk
> > array will let you read the indices very fast, using multicore CPUs,
> > adding memory and even if all that isn't good enough, you can always
> > use a small cluster (say, 4 nodes) of very, very inexpensive PCs
> > filled with a GB of RAM. You don't have to keep them inside the
> > regular UPS/backup/voult-protected area as the indices can always be
> > rebuilt (unlike e.g. data in transactional systems) and between 4 of
> > them they might cost like an entry-level server.
> > I'll let the experts speak now. :)
> >
> > t.n.a.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> > For additional commands, e-mail:
> >
> >
> Thanks for the tip...I am not too worried...I am miserable because I
> live in Dilbert land, not this particular incident. Spreading to
> multiple servers is a possibility but one I want to avoid...I wrote this
> app on the side since our current product is still needs a lot
> of work and thinking about distributing lucene at this point is a little
> much...I never even have time to work on this project as it is becuase I
> am currently tasked with porting the crap old project to Windows. I need
> to do a bunch to shore up what I have. No one cares though...they think
> that I have done nothing (or can't understand what I have done) while at
> the same time they want to use what I havn't done to do what I made it
> for as well as this new super archive of 30 million + the end
> I'll be looking for a new job...still curious about lucene scaling to 30
> million docs with a sort on every search though (yes I know the sort is
> cached...worries me too though...the sort will be on multiple and
> different fields depending no what the searcher wants...uggg...the size
> of the caches....)
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message