Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 13124 invoked from network); 10 Nov 2006 15:25:09 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 10 Nov 2006 15:25:09 -0000 Received: (qmail 71699 invoked by uid 500); 10 Nov 2006 15:25:15 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 71657 invoked by uid 500); 10 Nov 2006 15:25:15 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 71646 invoked by uid 99); 10 Nov 2006 15:25:15 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Nov 2006 07:25:15 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Nov 2006 07:25:04 -0800 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 0A9F571431B for ; Fri, 10 Nov 2006 07:24:44 -0800 (PST) Message-ID: <1198282.1163172284041.JavaMail.jira@brutus> Date: Fri, 10 Nov 2006 07:24:44 -0800 (PST) From: "Michael McCandless (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system In-Reply-To: <2037036386.1143501871254.JavaMail.jira@ajax> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ http://issues.apache.org/jira/browse/LUCENE-532?page=comments#action_12448774 ] Michael McCandless commented on LUCENE-532: ------------------------------------------- Sorry, I meant "dup of LUCENE-704 " above. > [PATCH] Indexing on Hadoop distributed file system > -------------------------------------------------- > > Key: LUCENE-532 > URL: http://issues.apache.org/jira/browse/LUCENE-532 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 1.9 > Reporter: Igor Bolotin > Priority: Minor > Attachments: indexOnDFS.patch, SegmentTermEnum.patch, TermInfosWriter.patch > > > In my current project we needed a way to create very large Lucene indexes on Hadoop distributed file system. When we tried to do it directly on DFS using Nutch FsDirectory class - we immediately found that indexing fails because DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason for this behavior is clear - DFS does not support random updates and so seek() method can't be supported (at least not easily). > > Well, if we can't support random updates - the question is: do we really need them? Search in the Lucene code revealed 2 places which call IndexOutput.seek() method: one is in TermInfosWriter and another one in CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the only place that concerned us was in TermInfosWriter. > > TermInfosWriter uses IndexOutput.seek() in its close() method to write total number of terms in the file back into the beginning of the file. It was very simple to change file format a little bit and write number of terms into last 8 bytes of the file instead of writing them into beginning of file. The only other place that should be fixed in order for this to work is in SegmentTermEnum constructor - to read this piece of information at position = file length - 8. > > With this format hack - we were able to use FsDirectory to write index directly to DFS without any problems. Well - we still don't index directly to DFS for performance reasons, but at least we can build small local indexes and merge them into the main index on DFS without copying big main index back and forth. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org