lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Distributed Indexes, Searches and HDFS
Date Fri, 22 Sep 2006 13:23:32 GMT

I think this is a great question ("what's the best way to really scale
up Lucene?").  I don't have alot of experience in that area so I'll
defer to others (and I'm eager to learn myself!).

I think understanding Solr's overall approach (whose design I believe
came out of the thread you've referenced) is also a good step here.
Even if you can't re-use the hard links trick, you might be able to
reuse its snapshotting & index distribution protocol.

However, I have been working on some "bottoms up" improvements to
Lucene (getting native OS locking working and [separate but related]
"lock-less commits") that I think could be related to some of the
issues you're seeing with HDFS -- see below:

 > > The cronjob/link solution which is quite clean, doesn't work well in
 > > a windows environment. While it's my favorite, no dice... Rats.
 > There may be hope yet for that on Windows.  Hard links work on
 > Windows, but the only problem is that you can't rename/delete any
 > links when the file is open. Michael McCandless is working on a
 > patch that would eliminate all renames (and deletes can be handled
 > by deferring them).

Right, with "lock-less commits" patch we never rename a file and also
never re-use a file name (ie, making Lucene's use of the filesystem
"write once").

 > 1) Indexing and Searching Directly from HDFS
 > Indexing to HDFS is possible with a patch if we don't use CFS. While
 > not ideal performance-wise, it's reliable and takes care of data
 > redundancy, component failure and means that I can have cheap small
 > drives instead of a large expensive NAS. It's also quite simple to
 > implement (see Nutch's indexer.FsDirectory for the Directory
 > implmentation)

This is very interesting!  I don't know enough about HDFS (yet!).  On
very quick read, I like that it's a "write once" filesystem because
it's a good match to lock-less commits.

 > So I would have several indexes (ie 16) and the same number of
 > indexers, and a searcher for each index (possibly in the same
 > process) that searches each one directly from HDFS. One problem I'm
 > having is an occasional filenotfound exception. (Probably locking
 > related)
 > It comes out of the Searcher when I try to do a search while things
 > are being indexed. I'd be interested to know what exactly is
 > happening when this exception is thrown, maybe I can design around
 > it. (Do synchronization at the appropriate times or similar)

That exception looks disturbingly similar to the ones Lucene hits on
NFS.  See here for gory details:

The summary of that [long] issue is that these exceptions seem to be
due to cache staleness of Lucene's "segments" file (due to how the NFS
client does caching, even on NFS V4 client/server) and not in fact due
to locking (as had been previously assumed/expected).  The good news
is the lock-less commits fixes resolve this at least in my testing so
far (ie, make it possible to share a single index over NFS).

I wonder if in HDFS a similar cause is at work?  HDFS is "write once"
but the current Lucene isn't (not until we can get lock-less commits
in).  For example, it re-uses the "segments" file.

I think even if lock-less commits ("write once") enables sharing of a
single copy of index over remote filesystems like HDFS or NFS or
SMB/CIFS, whether or not that's performant enough (vs replicating
copies to local filesystems that are presumably quite a bit faster at
IO, at the expense of local storage consumed) would still be a big
open question.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message