Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 5874 invoked from network); 5 Feb 2004 22:02:30 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 5 Feb 2004 22:02:30 -0000 Received: (qmail 98016 invoked by uid 500); 5 Feb 2004 22:01:55 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 97980 invoked by uid 500); 5 Feb 2004 22:01:54 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 97944 invoked from network); 5 Feb 2004 22:01:54 -0000 Received: from unknown (HELO rwcrmhc11.comcast.net) (204.127.198.35) by daedalus.apache.org with SMTP; 5 Feb 2004 22:01:54 -0000 Received: from apache.org (c-24-5-145-151.client.comcast.net[24.5.145.151]) by comcast.net (rwcrmhc11) with SMTP id <200402052151330130070dgae>; Thu, 5 Feb 2004 21:51:33 +0000 Message-ID: <4022BADE.7040007@apache.org> Date: Thu, 05 Feb 2004 13:51:26 -0800 From: Doug Cutting User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4.1) Gecko/20031114 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Users List Subject: Re: SQLDirectory References: In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Philippe Laflamme wrote: > I've worked on an implementation for Postgres. I used the Large Object API > provided by the Postgres JDBC driver. It works fine but I doubt it is very > scalable because the number of open connections during indexing can become > very high. > > Lucene opens many different files when writing to an index. This results in > opening one PG connection per open file. While working on a small index (30 > 000 files), I saw the number of open connections become quite high (approx > 150). If you don't have a lot of RAM, this is problematic. A connection per file sounds very heavyweight. The way I would try to implement Directory with SQL is to have a single table of buffers per index, e.g., with columns ID, BLOCK_NUMBER and DATA. The contents of a file are the appended DATA columns with the same ID, ordered by the BLOCK_NUMBER field. This would be indexed by ID and BLOCK_NUMBER, together a unique key. The BLOCK_NUMBER field indicates which part of the file the row concerns. Thus the DATA of BLOCK_NUMBER=0 might hold the first 1024 bytes, the DATA of BLOCK_NUMBER=1 might hold the next 1024 bytes, and so on. This would permit efficient random access. You'll need another table with NAME, ID, and MODIFIED_DATE, with a single entry per file. The length of a file can be computed with a query that finds the length of DATA in the last BLOCK_NUMBER with an ID. I would initially cache a single connection to the database and serialize requests over it. A pool of connections might be more efficient when multiple threads are searching, but I would benchmark that before investing much in such an implementation. Has anyone yet implemented an SQL Directory this way? Doug --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org