lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcus Herou <marcus.he...@tailsweep.com>
Subject Re: Scaling out/up or a mix
Date Mon, 29 Jun 2009 14:40:04 GMT
Hi thanks for your answer, comments inline.

On Mon, Jun 29, 2009 at 10:06 AM, eks dev <eksdev@yahoo.co.uk> wrote:

>
> depends on your architecture, will you partition your index? What is max
> expected size of your index (you said 128G and growing..) what do you mean
> with growing? You have in both options enogh memory to load it into RAM...

Yes we partition the index, with a simple RoundRobin algo.
The options was just to give the reader some visibility in what kind of
hardware you get depending on which path you choose. I do not really have
that amount of money to spend right now. More like 1/6th of that really.
We crawl blogs... The number of blogs we find is still increasing and we are
not nearly indexing all languages => It will grow at least linear. Let's say
at least 10-20G a month or so ?

>
>
> I would definitly try to have less machines and alot of memory, so that
> your index fits into ram comfortably...

OK, so you mean that one should aim for fitting the shard into RAM...

>
>
> IMO, 8Gig per machine is rather smalish, but depends heavily on your access
> patterns... how many documents you need to load from disk per query? If this
> does not create huge on IO, you could try to load everything but stored
> fields into RAM

We store no fields in the index besides the actual DB id. We load no more
than 50 docs at a time.

>
>
> What are your requirements on Indexing side (once a day, week, 15 Minutes),
> how you distribute index to all these machines...

We index all non office hours.

>
>
> Your question: IO or CPU bound, depends, if you load it into RAM it becomes
> Memeory-bus/CPU bound, if it is mainly on disk it will be IO bound

OK like I suspected, answers my previous question(s).


Final question:

Based on your findings what is the most challenging part to tune ? Sorting
or querying or what else?

//Marcus



>
>
>
>
>
>
> ----- Original Message ----
> > From: Marcus Herou <marcus.herou@tailsweep.com>
> > To: java-user@lucene.apache.org
> > Sent: Monday, 29 June, 2009 9:47:13
> > Subject: Re: Scaling out/up or a mix
> >
> > Thanks for the answer.
> >
> > Don't you think that part 1 of the email would give you a hint of nature
> of
> > the index ?
> >
> > Index size(and growing): 16Gx8 = 128G
> > Doc size (data): 20k
> > Num docs: 90M
> > Num users: Few hundred but most critical is that the admin staff which is
> > using the index all day long.
> > Query types: Example: title:"Iphone" OR description:"Iphone" sorted by
> > publishedDate... = Very simple, no fuzzy searches etc. However since the
> > dataset is large it will consume memory on sorting I guess.
> >
> > Could not one draw any conclusions about best-practice in terms of
> hardware
> > given the above "specs" ?
> >
> > Basically I would like to know if I really need 8 cores since machines
> with
> > dual-cpu support are the most expensive and I would like to not throw
> away
> > money so getting it right is a matter of economy.
> >
> > I mean it is very simple: Let's say someone gives me a budget of 50 000
> USD
> > and I then want to get the most bang for the buck for my workload.
> > Should I go for
> > X machines with quad-core 3.0GHz, 4 disks RAID1+0, 8G RAM costing me
> 1200USD
> > a piece (giving me 40 machines: 160 disks, 160 cores, 320G RAM)
> > or
> > X machines with dual quad-core 2.0GHz, 4 disks RAID1+0, 36G RAM costing
> me
> > 3400 USD a piece (giving me 15 machines:  60 disks, 120 cores,  540G RAM)
> >
> > Basically I would like to know what factors make the workload IO bound vs
> > CPU bound ?
> >
> > //Marcus
> >
> >
> >
> >
> >
> >
> > On Mon, Jun 29, 2009 at 8:53 AM, Eric Bowman wrote:
> >
> > > There is no single answer -- this is always application specific.
> > >
> > > Without knowing anything about what you are doing:
> > >
> > > 1. disk i/o is probably the most critical.  Go SSD or even RAM disk if
> > > you can, if performance is absolutely critical
> > > 2. Sometimes CPU can become an issue, but 8 cores is probably enough
> > > unless you are doing especially cpu-bound searches.
> > >
> > > Unless you are doing something with hard performance requirements, or
> > > really quite unusual, buying "good" kit is probably good enough, and
> you
> > > won't really know for sure until you measure.  Lucene is a general
> > > enough tool that there isn't a terribly universal answer to this.  We
> > > were a bit surprised to end up cpu-bound instead of disk i/o-bound, for
> > > instance, but we ended up taking an unusual path.  YMMV.
> > >
> > > Marcus Herou wrote:
> > > > Hi. I think I need to be more specific.
> > > >
> > > > What I am trying to find out is if I should aim for:
> > > >
> > > > CPU (2x4 cores, 2.0-3.0Ghz)? or perhaps just a 4 cores is enough.
> > > > Fast disk IO: 8 disks, RAID1+0 ? or perhaps 2 disks is enough...
> > > > RAM - if the index does not fit into RAM how much RAM should I then
> buy ?
> > > >
> > > > Please any hints would be appreciated since I am going to invest
> soon.
> > > >
> > > > //Marcus
> > > >
> > > > On Sat, Jun 27, 2009 at 12:00 AM, Marcus Herou
> > > > wrote:
> > > >
> > > >
> > > >> Hi.
> > > >>
> > > >> I currently have an index which is 16GB per machine (8 machines =
> 128GB)
> > > >> (data is stored externally, not in index) and is growing like crazy
> (we
> > > are
> > > >> indexing blogs which is crazy by nature) and have only allocated 2GB
> per
> > > >> machine to the Lucene app since we are running some other stuff
> there in
> > > >> parallell.
> > > >>
> > > >> Each doc should be roughly the size of a blog post, no more than
> 20k.
> > > >>
> > > >> We currently have about 90M documents and it is increasing rapidly
> so
> > > >> getting into the G+ document range is not going to be too far away.
> > > >>
> > > >> Now due to search performance I think I need to move these instances
> to
> > > >> dedicated index/search machines (or index on some machines and
> search on
> > > >> others). Anyway I would like to get some feedback about two things:
> > > >>
> > > >> 1. What is the most important hardware aspect when it comes to add
> > > document
> > > >> to the index and optimize it.
> > > >> 1.1 Is it disk I|O write throghput ? (sequential or random-io ?)
> > > >> 1.2 Is it RAM ?
> > > >> 1.3 Is is CPU ?
> > > >>
> > > >> My guess would be disk-io, right, wrong ?
> > > >>
> > > >> 2. What is the most important hardware aspect when it comes to
> searching
> > > >> documents in my setup ? (result-set is limited to return only the
> top 10
> > > >> matches with page handling)
> > > >> 2.1 Is it disk read throughput ? (sequential or random-io ?)
> > > >> 2.2 Is it RAM ?
> > > >> 2.3 Is is CPU ?
> > > >>
> > > >> I have no clue since the data might not fit into memory. What is
> then
> > > the
> > > >> most important factor ? read-performance while scanning the index
?
> CPU
> > > >> while comparing fields and collecting results ?
> > > >>
> > > >> What I'm trying to find out is what I can do to get most bang for
> the
> > > buck
> > > >> with a limited (aren't we all limited?) budget.
> > > >>
> > > >> Kindly
> > > >>
> > > >> //Marcus
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Marcus Herou CTO and co-founder Tailsweep AB
> > > >> +46702561312
> > > >> marcus.herou@tailsweep.com
> > > >> http://www.tailsweep.com/
> > > >>
> > > >>
> > > >>
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Eric Bowman
> > > Boboco Ltd
> > > ebowman@boboco.ie
> > > http://www.boboco.ie/ebowman/pubkey.pgp
> > >
> > +35318394189/+353872801532
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > marcus.herou@tailsweep.com
> > http://www.tailsweep.com/
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message