lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: 30 milllion+ docs on a single server
Date Sat, 12 Aug 2006 21:08:59 GMT
Frustrated is the word :) I have looked at Solr...what I am worried 
about there is this: Solr says it requires an OS that supports hard 
links. Currently Windows does not to my knowledge. Someone seemed to 
make a comment that Windows could be supported...from what I know I 
don't think so. Not a deal breaker per say but then there is this: I 
have done a lot with the lucene API. I have created a custom query 
language to lucene query parser. I have changed the standard parser. I 
have made heavy use of Multi-Searchers. I am really tied into the Lucene 
API. I am worried about how easy it will be to integrate that into Solr. 
Perhaps I can just grab the distributed part of Solr but I do not know. 
I have so much to do that worrying about a distributed search seems like 
too big a scope for now. It seemed to me that breaking up the index with 
an RMI searcher was the easiest approach anyway. In the end...I would 
really like to stay on one server. This server will prob have multiple 
procs...should I make sure I incorporate a parallel searcher option?

In the end I am really just hoping for some more insight into this exact 
question:

Can I index 30 million+ docs that range in size form 2-10kb on a single 
server in a Windows environment (accesss to a max of about 1.5 gig of 
RAM). The average search will need to be sorted by field not relevancy.

Do you think its possible or a pipe dream? I realize I need to test to 
find out...but I am looking for someone with experience to pipe in 
before I get to that point.

Thanks for the response so far...I love the lucene mailing list.

Thanks,
Mark


Ray Tsang wrote:
> i've indexed 80m records and now up to 200m.. it can be done, and 
> could've
> been done better.  like the other said, architecture is important.  
> have you
> considered looking into solr?  i haven't kept up with it (and many of the
> mailing lists...), but looks very interesting.
>
> ray,
>
> On 8/12/06, Jason Polites <jason.polites@gmail.com> wrote:
>>
>> Sounds like you're a bit frustrated.  Cheer up, the simple fact is that
>> engineering and business rarely see eye-to-eye.  Just focus on the fact
>> that
>> what you have learnt from the process will help you, and they paid 
>> for it
>> ;)
>>
>> On the issue at hand...Lucene should scale to this level, but you need a
>> good architecture behind it.  Google has good indexing tech, but it's
>> their
>> architecture that allows them to spread the index across thousands of
>> servers which really gives it grunt (to the point that they invented 
>> their
>> own RAID-style file system).
>>
>> Just think very carefully about the architecture underpinning the index.
>> Lucene is core-tech.  It's up to you to provide the framework to make it
>> hum.
>>
>> On 8/12/06, Mark Miller <markrmiller@gmail.com> wrote:
>> >
>> > Tomi NA wrote:
>> > > On 8/12/06, Mark Miller <markrmiller@gmail.com> wrote:
>> > >> I've made a nice little archive application with lucene. I made 
>> it to
>> > >> handle our largest need: 2.5 million docs or so on a single server.
>> Now
>> > >> the powers that be say: lets use it for a 30+ million document
>> archive
>> > >> on a single server! (each doc size maybe 10k max...as small as a 
>> 1 or
>> > >> 2k) Please tell me why we are in trouble...please tell me why we 
>> are
>> > >> not. I have tested up to 2 million docs without much trouble but 30
>> > >> million...the average search will include a sort on a field as
>> > >> well...can I search 30+ million docs with a sort? Man am I worried
>> > about
>> > >> that. Maybe the server will have 8 procs and 12 billion gigs of 
>> RAM.
>> > >> Mabye. Even still, Tomcat seems to be able to launch with a max of
>> 1.5
>> > >> or 1.6 gig of Ram in Windows. What do you think? 30 million+ sounds
>> > like
>> > >> too much of a load to me for a single server. Not that they care 
>> what
>> I
>> > >> think...I only wrote the thing (man I hate my job, offer me a 
>> new one
>> > :)
>> > >> )...please...comments?
>> > >>
>> > >> Cheers,
>> > >>
>> > >> Miserable Mark
>> > >
>> > > I don't really understand what you're so worried about. Either it'll
>> > > work well with the setup you have, or it won't. It's really the size
>> > > of it. ;)
>> > > Seriously, you have a number of relatively cheap possibilities at 
>> hand
>> > > to improve search performance: storing the index on a RAID 5 disk
>> > > array will let you read the indices very fast, using multicore CPUs,
>> > > adding memory and even if all that isn't good enough, you can always
>> > > use a small cluster (say, 4 nodes) of very, very inexpensive PCs
>> > > filled with a GB of RAM. You don't have to keep them inside the
>> > > regular UPS/backup/voult-protected area as the indices can always be
>> > > rebuilt (unlike e.g. data in transactional systems) and between 4 of
>> > > them they might cost like an entry-level server.
>> > > I'll let the experts speak now. :)
>> > >
>> > > t.n.a.
>> > >
>> > > 
>> ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > > For additional commands, e-mail: java-user-help@lucene.apache.org
>> > >
>> > >
>> > Thanks for the tip...I am not too worried...I am miserable because I
>> > live in Dilbert land, not this particular incident. Spreading to
>> > multiple servers is a possibility but one I want to avoid...I wrote 
>> this
>> > app on the side since our current product is crap...it still needs 
>> a lot
>> > of work and thinking about distributing lucene at this point is a 
>> little
>> > much...I never even have time to work on this project as it is 
>> becuase I
>> > am currently tasked with porting the crap old project to Windows. I 
>> need
>> > to do a bunch to shore up what I have. No one cares though...they 
>> think
>> > that I have done nothing (or can't understand what I have done) 
>> while at
>> > the same time they want to use what I havn't done to do what I made it
>> > for as well as this new super archive of 30 million + docs...in the 
>> end
>> > I'll be looking for a new job...still curious about lucene scaling 
>> to 30
>> > million docs with a sort on every search though (yes I know the 
>> sort is
>> > cached...worries me too though...the sort will be on multiple and
>> > different fields depending no what the searcher wants...uggg...the 
>> size
>> > of the caches....)
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message