Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 20655 invoked from network); 10 Nov 2009 09:57:04 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 10 Nov 2009 09:57:04 -0000 Received: (qmail 7952 invoked by uid 500); 10 Nov 2009 09:57:01 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 7893 invoked by uid 500); 10 Nov 2009 09:57:01 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 7883 invoked by uid 99); 10 Nov 2009 09:57:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Nov 2009 09:57:01 +0000 X-ASF-Spam-Status: No, hits=-1.2 required=5.0 tests=AWL,BAYES_00,HTML_MESSAGE,MIME_QP_LONG_LINE X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of hrishikesh_agashe@persistent.co.in designates 202.54.11.87 as permitted sender) Received: from [202.54.11.87] (HELO bmapps.persistent.co.in) (202.54.11.87) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Nov 2009 09:56:58 +0000 X-AuditID: 0a4e0006-b7bc0ae000007fb4-ab-4af938d2ca29 Received: from puneexchange.persistent.co.in ( [10.78.0.1]) (using TLS with cipher AES128-SHA (AES128-SHA/128 bits)) (Client did not present a certificate) by (Symantec Mail Security) with SMTP id B1.C9.32692.2D839FA4; Tue, 10 Nov 2009 15:26:34 +0530 (IST) Received: from Exchange.persistent.co.in ([fe80::ad8a:d553:ef93:bfdd]) by CAS1 ([10.77.224.46]) with mapi; Tue, 10 Nov 2009 15:26:33 +0530 From: Hrishikesh Agashe To: "common-user@hadoop.apache.org" Date: Tue, 10 Nov 2009 15:26:33 +0530 Subject: Lucene + Hadoop Thread-Topic: Lucene + Hadoop Thread-Index: Acph7BV2h6H4f+uSRNG0dqFYC9Jtig== Message-ID: Accept-Language: en-US, en-IN Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US, en-IN Content-Type: multipart/alternative; boundary="_000_E3137F97B0A0804194F369F4E7C3B71929AF5E59D8EXCHANGEpersi_" MIME-Version: 1.0 X-Brightmail-Tracker: AAAAAQAAAZE= --_000_E3137F97B0A0804194F369F4E7C3B71929AF5E59D8EXCHANGEpersi_ Content-Type: text/plain; charset="us-ascii" content-transfer-encoding: quoted-printable Hi, I am trying to use Hadoop for Lucene index creation. I have to create multip= le indexes based on contents of the files (i.e. if author is "hrishikesh", i= t should be added to a index for "hrishikesh". There has to be a separate in= dex for every author). For this, I am keeping multiple IndexWriter open for= every author and maintaining them in a hashmap in map() function. I parse i= ncoming file and if I see author is one for which I already have opened a In= dexWriter, I just add this file in that index, else I create a new IndesWrit= er for new author. As authors might run into thousands, I am closing IndexWr= iter and clearing hashmap once it reaches a certain threshold and starting a= ll over again. There is no reduced function. Does this logic sound correct? Is there any other way of implementing this r= equirement? --Hrishi DISCLAIMER=0A= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=0A= This e-mail may contain privileged and confidential information which is the= property of Persistent Systems Ltd. It is intended only for the use of the= individual or entity to which it is addressed. If you are not the intended= recipient, you are not authorized to read, retain, copy, print, distribute= or use this message. If you have received this communication in error, plea= se notify the sender and delete all copies of this message. Persistent Syste= ms Ltd. does not accept any liability for virus infected mails. --_000_E3137F97B0A0804194F369F4E7C3B71929AF5E59D8EXCHANGEpersi_--