Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <1246351752.3464.18.camel@pc286>
References: <7e536b1f0906261500m297efb0cv107e2b2c5cd94ac3@mail.gmail.com>
	 <7e536b1f0906281413m276606ccyca58036de05708b6@mail.gmail.com>
	 <4A4864E7.3070609@boboco.ie>
	 <7e536b1f0906290047g14322a5bm55f6740090fd32d2@mail.gmail.com>
	 <1246351752.3464.18.camel@pc286>
Date: Tue, 30 Jun 2009 22:59:40 +0200
Message-ID: <7e536b1f0906301359g3b7d1259v18987e82466ff48f@mail.gmail.com>
Subject: Re: Scaling out/up or a mix
From: Marcus Herou <marcus.herou@tailsweep.com>
To: java-user@lucene.apache.org, te@statsbiblioteket.dk
Content-Type: multipart/alternative; boundary=0016e6d37661d0c9c7046d9715d0

--0016e6d37661d0c9c7046d9715d0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Hi.

The number of concurrent users today is insignficant but once we push for
the service we will get into trouble... I know that since even one simple
faceting query (which we will use to display trend graphs) can take forever
(talking about SOLR bytw). "Normal" Lucene queries (title:blah OR
description:blah) timing is reasonable for the current hardware but not good
(Currently 8 machines 2GB RAM each serving 130G index). It takes less than
10 secs at all times which of course is very bad user experience.

If someone need to understand more about the nature of this app I think we
are quite alike technorati (if we would show all bling-bling) or twingly.com.
Basically a blogsearch app.

Example of a public query (no sorting on publisheddate but rather on
relevance = faster):
http://blogsearch.tailsweep.com/search.do?wa=test&la=all

And while you are at it, look at our cool BlogSpace:
http://blogsearch.tailsweep.com/showFeed.do?feedId=114799

Sorry not meaning to advertise but I could not help it :)


//Marcus


On Tue, Jun 30, 2009 at 10:49 AM, Toke Eskildsen <te@statsbiblioteket.dk>wrote:

> On Mon, 2009-06-29 at 09:47 +0200, Marcus Herou wrote:
> > Index size(and growing): 16Gx8 = 128G
> > Doc size (data): 20k
> > Num docs: 90M
> > Num users: Few hundred but most critical is that the admin staff which is
> > using the index all day long.
> > Query types: Example: title:"Iphone" OR description:"Iphone" sorted by
> > publishedDate... = Very simple, no fuzzy searches etc. However since the
> > dataset is large it will consume memory on sorting I guess.
> >
> > Could not one draw any conclusions about best-practice in terms of
> hardware
> > given the above "specs" ?
>
> Can you give us an estimate of the number of concurrent searches in
> prime time and in what range a satisfactory response time would be?
>
> Going for a fully RAM-based search on a corpus of this size would mean
> that each machine holds about 30GB of index (taken from your hardware
> suggestion). I would expect that such a machine would be able to serve
> something like 500-1000 searches/second (highly dependent on the index
> and the searches, but what you're describing sounds simple enough) if we
> just measure the raw search time and lookup of one or two fields for the
> first 20 hits. It that what you're aiming for?
>
> Wrapping in web services and such lowers the number of searches that can
> be performed, which makes the RAM-option even more expensive relative to
> a harddisk or SSD solution.
>
> > I mean it is very simple: Let's say someone gives me a budget of 50 000
> USD
> > and I then want to get the most bang for the buck for my workload.
>
> I am a bit unclear on your overall goal. Do you expect the number of
> users to grow significantly?
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/

--0016e6d37661d0c9c7046d9715d0--