lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <eks...@yahoo.co.uk>
Subject Re: Scaling out/up or a mix
Date Mon, 29 Jun 2009 08:06:50 GMT

depends on your architecture, will you partition your index? What is max expected size of
your index (you said 128G and growing..) what do you mean with growing? You have in both options
enogh memory to load it into RAM...

I would definitly try to have less machines and alot of memory, so that your index fits into
ram comfortably...

IMO, 8Gig per machine is rather smalish, but depends heavily on your access patterns... how
many documents you need to load from disk per query? If this does not create huge on IO, you
could try to load everything but stored fields into RAM

What are your requirements on Indexing side (once a day, week, 15 Minutes), how you distribute
index to all these machines... 
  
Your question: IO or CPU bound, depends, if you load it into RAM it becomes Memeory-bus/CPU
bound, if it is mainly on disk it will be IO bound






----- Original Message ----
> From: Marcus Herou <marcus.herou@tailsweep.com>
> To: java-user@lucene.apache.org
> Sent: Monday, 29 June, 2009 9:47:13
> Subject: Re: Scaling out/up or a mix
> 
> Thanks for the answer.
> 
> Don't you think that part 1 of the email would give you a hint of nature of
> the index ?
> 
> Index size(and growing): 16Gx8 = 128G
> Doc size (data): 20k
> Num docs: 90M
> Num users: Few hundred but most critical is that the admin staff which is
> using the index all day long.
> Query types: Example: title:"Iphone" OR description:"Iphone" sorted by
> publishedDate... = Very simple, no fuzzy searches etc. However since the
> dataset is large it will consume memory on sorting I guess.
> 
> Could not one draw any conclusions about best-practice in terms of hardware
> given the above "specs" ?
> 
> Basically I would like to know if I really need 8 cores since machines with
> dual-cpu support are the most expensive and I would like to not throw away
> money so getting it right is a matter of economy.
> 
> I mean it is very simple: Let's say someone gives me a budget of 50 000 USD
> and I then want to get the most bang for the buck for my workload.
> Should I go for
> X machines with quad-core 3.0GHz, 4 disks RAID1+0, 8G RAM costing me 1200USD
> a piece (giving me 40 machines: 160 disks, 160 cores, 320G RAM)
> or
> X machines with dual quad-core 2.0GHz, 4 disks RAID1+0, 36G RAM costing me
> 3400 USD a piece (giving me 15 machines:  60 disks, 120 cores,  540G RAM)
> 
> Basically I would like to know what factors make the workload IO bound vs
> CPU bound ?
> 
> //Marcus
> 
> 
> 
> 
> 
> 
> On Mon, Jun 29, 2009 at 8:53 AM, Eric Bowman wrote:
> 
> > There is no single answer -- this is always application specific.
> >
> > Without knowing anything about what you are doing:
> >
> > 1. disk i/o is probably the most critical.  Go SSD or even RAM disk if
> > you can, if performance is absolutely critical
> > 2. Sometimes CPU can become an issue, but 8 cores is probably enough
> > unless you are doing especially cpu-bound searches.
> >
> > Unless you are doing something with hard performance requirements, or
> > really quite unusual, buying "good" kit is probably good enough, and you
> > won't really know for sure until you measure.  Lucene is a general
> > enough tool that there isn't a terribly universal answer to this.  We
> > were a bit surprised to end up cpu-bound instead of disk i/o-bound, for
> > instance, but we ended up taking an unusual path.  YMMV.
> >
> > Marcus Herou wrote:
> > > Hi. I think I need to be more specific.
> > >
> > > What I am trying to find out is if I should aim for:
> > >
> > > CPU (2x4 cores, 2.0-3.0Ghz)? or perhaps just a 4 cores is enough.
> > > Fast disk IO: 8 disks, RAID1+0 ? or perhaps 2 disks is enough...
> > > RAM - if the index does not fit into RAM how much RAM should I then buy ?
> > >
> > > Please any hints would be appreciated since I am going to invest soon.
> > >
> > > //Marcus
> > >
> > > On Sat, Jun 27, 2009 at 12:00 AM, Marcus Herou
> > > wrote:
> > >
> > >
> > >> Hi.
> > >>
> > >> I currently have an index which is 16GB per machine (8 machines = 128GB)
> > >> (data is stored externally, not in index) and is growing like crazy (we
> > are
> > >> indexing blogs which is crazy by nature) and have only allocated 2GB per
> > >> machine to the Lucene app since we are running some other stuff there in
> > >> parallell.
> > >>
> > >> Each doc should be roughly the size of a blog post, no more than 20k.
> > >>
> > >> We currently have about 90M documents and it is increasing rapidly so
> > >> getting into the G+ document range is not going to be too far away.
> > >>
> > >> Now due to search performance I think I need to move these instances to
> > >> dedicated index/search machines (or index on some machines and search on
> > >> others). Anyway I would like to get some feedback about two things:
> > >>
> > >> 1. What is the most important hardware aspect when it comes to add
> > document
> > >> to the index and optimize it.
> > >> 1.1 Is it disk I|O write throghput ? (sequential or random-io ?)
> > >> 1.2 Is it RAM ?
> > >> 1.3 Is is CPU ?
> > >>
> > >> My guess would be disk-io, right, wrong ?
> > >>
> > >> 2. What is the most important hardware aspect when it comes to searching
> > >> documents in my setup ? (result-set is limited to return only the top 10
> > >> matches with page handling)
> > >> 2.1 Is it disk read throughput ? (sequential or random-io ?)
> > >> 2.2 Is it RAM ?
> > >> 2.3 Is is CPU ?
> > >>
> > >> I have no clue since the data might not fit into memory. What is then
> > the
> > >> most important factor ? read-performance while scanning the index ? CPU
> > >> while comparing fields and collecting results ?
> > >>
> > >> What I'm trying to find out is what I can do to get most bang for the
> > buck
> > >> with a limited (aren't we all limited?) budget.
> > >>
> > >> Kindly
> > >>
> > >> //Marcus
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Marcus Herou CTO and co-founder Tailsweep AB
> > >> +46702561312
> > >> marcus.herou@tailsweep.com
> > >> http://www.tailsweep.com/
> > >>
> > >>
> > >>
> > >
> > >
> > >
> >
> >
> > --
> > Eric Bowman
> > Boboco Ltd
> > ebowman@boboco.ie
> > http://www.boboco.ie/ebowman/pubkey.pgp
> > 
> +35318394189/+353872801532
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> 
> -- 
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/



      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message