lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Goodell <good...@gmail.com>
Subject Re: Scaling out/up or a mix
Date Tue, 30 Jun 2009 16:07:15 GMT
I have improved date-sorted searching performance pretty dramatically by
replacing the two step "search then sort" operation with a one step "use the
date as the score" algorithm.  The main gotcha was making sure to not affect
which results get counted as hits in boolean searches, but overall I only
spent about a week on the project, and got a 60x speed improvement on the
target set. (from minutes to seconds)  YMMV however, since the app requires
the collection of the complete set of results for analysis.

- andy g

On Mon, Jun 29, 2009 at 12:47 AM, Marcus Herou
<marcus.herou@tailsweep.com>wrote:

> Thanks for the answer.
>
> Don't you think that part 1 of the email would give you a hint of nature of
> the index ?
>
> Index size(and growing): 16Gx8 = 128G
> Doc size (data): 20k
> Num docs: 90M
> Num users: Few hundred but most critical is that the admin staff which is
> using the index all day long.
> Query types: Example: title:"Iphone" OR description:"Iphone" sorted by
> publishedDate... = Very simple, no fuzzy searches etc. However since the
> dataset is large it will consume memory on sorting I guess.
>
> Could not one draw any conclusions about best-practice in terms of hardware
> given the above "specs" ?
>
> Basically I would like to know if I really need 8 cores since machines with
> dual-cpu support are the most expensive and I would like to not throw away
> money so getting it right is a matter of economy.
>
> I mean it is very simple: Let's say someone gives me a budget of 50 000 USD
> and I then want to get the most bang for the buck for my workload.
> Should I go for
> X machines with quad-core 3.0GHz, 4 disks RAID1+0, 8G RAM costing me
> 1200USD
> a piece (giving me 40 machines: 160 disks, 160 cores, 320G RAM)
> or
> X machines with dual quad-core 2.0GHz, 4 disks RAID1+0, 36G RAM costing me
> 3400 USD a piece (giving me 15 machines:  60 disks, 120 cores,  540G RAM)
>
> Basically I would like to know what factors make the workload IO bound vs
> CPU bound ?
>
> //Marcus
>
>
>
>
>
>
> On Mon, Jun 29, 2009 at 8:53 AM, Eric Bowman <ebowman@boboco.ie> wrote:
>
> > There is no single answer -- this is always application specific.
> >
> > Without knowing anything about what you are doing:
> >
> > 1. disk i/o is probably the most critical.  Go SSD or even RAM disk if
> > you can, if performance is absolutely critical
> > 2. Sometimes CPU can become an issue, but 8 cores is probably enough
> > unless you are doing especially cpu-bound searches.
> >
> > Unless you are doing something with hard performance requirements, or
> > really quite unusual, buying "good" kit is probably good enough, and you
> > won't really know for sure until you measure.  Lucene is a general
> > enough tool that there isn't a terribly universal answer to this.  We
> > were a bit surprised to end up cpu-bound instead of disk i/o-bound, for
> > instance, but we ended up taking an unusual path.  YMMV.
> >
> > Marcus Herou wrote:
> > > Hi. I think I need to be more specific.
> > >
> > > What I am trying to find out is if I should aim for:
> > >
> > > CPU (2x4 cores, 2.0-3.0Ghz)? or perhaps just a 4 cores is enough.
> > > Fast disk IO: 8 disks, RAID1+0 ? or perhaps 2 disks is enough...
> > > RAM - if the index does not fit into RAM how much RAM should I then buy
> ?
> > >
> > > Please any hints would be appreciated since I am going to invest soon.
> > >
> > > //Marcus
> > >
> > > On Sat, Jun 27, 2009 at 12:00 AM, Marcus Herou
> > > <marcus.herou@tailsweep.com>wrote:
> > >
> > >
> > >> Hi.
> > >>
> > >> I currently have an index which is 16GB per machine (8 machines =
> 128GB)
> > >> (data is stored externally, not in index) and is growing like crazy
> (we
> > are
> > >> indexing blogs which is crazy by nature) and have only allocated 2GB
> per
> > >> machine to the Lucene app since we are running some other stuff there
> in
> > >> parallell.
> > >>
> > >> Each doc should be roughly the size of a blog post, no more than 20k.
> > >>
> > >> We currently have about 90M documents and it is increasing rapidly so
> > >> getting into the G+ document range is not going to be too far away.
> > >>
> > >> Now due to search performance I think I need to move these instances
> to
> > >> dedicated index/search machines (or index on some machines and search
> on
> > >> others). Anyway I would like to get some feedback about two things:
> > >>
> > >> 1. What is the most important hardware aspect when it comes to add
> > document
> > >> to the index and optimize it.
> > >> 1.1 Is it disk I|O write throghput ? (sequential or random-io ?)
> > >> 1.2 Is it RAM ?
> > >> 1.3 Is is CPU ?
> > >>
> > >> My guess would be disk-io, right, wrong ?
> > >>
> > >> 2. What is the most important hardware aspect when it comes to
> searching
> > >> documents in my setup ? (result-set is limited to return only the top
> 10
> > >> matches with page handling)
> > >> 2.1 Is it disk read throughput ? (sequential or random-io ?)
> > >> 2.2 Is it RAM ?
> > >> 2.3 Is is CPU ?
> > >>
> > >> I have no clue since the data might not fit into memory. What is then
> > the
> > >> most important factor ? read-performance while scanning the index ?
> CPU
> > >> while comparing fields and collecting results ?
> > >>
> > >> What I'm trying to find out is what I can do to get most bang for the
> > buck
> > >> with a limited (aren't we all limited?) budget.
> > >>
> > >> Kindly
> > >>
> > >> //Marcus
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Marcus Herou CTO and co-founder Tailsweep AB
> > >> +46702561312
> > >> marcus.herou@tailsweep.com
> > >> http://www.tailsweep.com/
> > >>
> > >>
> > >>
> > >
> > >
> > >
> >
> >
> > --
> > Eric Bowman
> > Boboco Ltd
> > ebowman@boboco.ie
> > http://www.boboco.ie/ebowman/pubkey.pgp
> > +35318394189/+353872801532<
> http://www.boboco.ie/ebowman/pubkey.pgp%0A+35318394189/+353872801532>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message