lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Kurz <n...@verse.com>
Subject Re: [lucy-user] IO ponderings
Date Sun, 18 Sep 2011 18:44:19 GMT
On Sat, Sep 17, 2011 at 12:47 PM, Marvin Humphrey
<marvin@rectangular.com> wrote:
> On Sat, Sep 17, 2011 at 08:52:41AM +0200, goran kent wrote:
>> I've been wondering (and I'll eventually get around to performing a
>> comparative test sometime this weekend) about IO and search
>> performance (ie, ignore OS caching).

As Marvin pointed out, while it's fine to ask the question about what
happens when you ignore OS caching, realize that OS caching is crucial
to Lucy's performance.  We don't do our own caching, rather we rely on
the OS to do it for us.  A clear understanding of your operating
systems virtual memory system will be very helpful in figuring out
bottlenecks.

If you're not already intimate with these details, this article is a
good start: http://duartes.org/gustavo/blog/category/internals

>> What's the biggest cause of search degradation when Lucy is chugging
>> through it's on-disk index?
>>
>> Physically *finding* data (ie, seeking and thrashing around the disk),
>> waiting for data to *transfer* from the disk to CPU?

This is going to depend on your exact use case.   I think you can
assume that all accesses that can be sequential off the disk will be,
or can easily be made to be sequential by consolidating the index.
Thus if you are searching for a small number of common words coming
from text documents, the search time will depend primarily on bulk
transfer speed from your disk.  If on the other hand if each query is
for a list of hundreds of rare part numbers, seek time will dominate
and an SSD might help a lot.

And for the earlier question of rerunning queries until adequate
coverage is achieved, this probably isn't as inefficient as you'd
guess.  If you presume you'll be reading a bunch of data from disk,
once you have the data in the OS cache, running another 5 queries
probably doesn't even double your search time.  Unless your index is
so large that you can't even fit a single query into RAM, in which
case you've got other problems.


> Well, the projects I've been involved with have taken the approach that there
> should always be enough RAM on the box to fit the necessary index files.  "RAM
> is the new disk" as they say.
>
> I can tell you that once an index is in RAM, we're CPU bound.

While it's probably technically true that we're CPU bound, I think the
way to improve performance is not by shaving cycles but by figuring
out better ways to take advantage of memory locality.  Currently we do
a pretty good job of avoiding disk access.  Eventually we'll start
doing better at avoiding RAM access, so  we can do more operations in
L3.

Sticking with Gustavo:
http://duartes.org/gustavo/blog/post/what-your-computer-does-while-you-wait

Ulrich's paper is a great intro as well:
http://people.redhat.com/drepper/cpumemory.pdf

>> I'm quite interested to know whether using an SSD where seek time and
>> other latency issues are almost zero would dramatically improve search
>> times.  I've seen vast improvements when using them in RDBMS', but
>> this may not translate as well here.
>
> I would speculate that with SSDs you'd get a more graceful performance
> degradation as Lucy's RAM requirements start to exceed what the box can
> provide.  But I have no numbers to back that up.

It will depend on a lot of factors.  My instinct is that SSD's will
help but won't be cost effective.   I think that you'd be better off
spending all your budget on a motherboard that supports a lot of RAM
(which probably means a dual Xeon:
http://www.supermicro.com/products/motherboard/Xeon1333), as much  ECC
RAM as you can afford (144GB = $4k; 288GB = $10K), and then a cheap
RAID of big fast spinning disks.

I don't know these guys, but they might be useful for quick ballpark
prices: http://www.abmx.com/dual-xeon-server

> My index is way too large to fit into RAM - yes, it's split across a
> cluster, but there are physical space and cost constraints, so the
> cluster cannot get much larger.  That's my reality, unfortunately.
>
> Hence my emphasis on IO and ways to address that with alternate tech
> such as SSD.

Goran:  you'll probably get better advice if you offer more details on
these constraints, your anticipated usage, a real estimate of corpus
size, and your best guess as to usage patterns.  "Way too large" can
mean many things to many people.  There may be a sweet spot between
1TB and 10TB where a SSD RAID might make sense, but less than that I
think you're better with RAM and more than that you're probably
getting unwieldy.   Numbers would help.

Equally, "physical space and cost constraints" have a lot of wiggle
room.  Are you dumpster diving for 286's, or trying to avoid custom
made motherboards?   Do you have only a single rack to work with, are
you trying to make something that can be worn as a wrist watch?  :)

Nathan Kurz
nate@verse.com

ps.  One other comment that I haven't seen made:  Lucy is optimized
for a 64-bit OS.  Most of the development and testing has been done on
Linux.  Thus if you are performance obsessed, trying to run at a large
scale, and want something works out of the box, you probably want to
be running on 64-bit Linux at this point.

Mime
View raw message