Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B48F610A46 for ; Mon, 29 Dec 2014 17:35:59 +0000 (UTC) Received: (qmail 51137 invoked by uid 500); 29 Dec 2014 17:35:58 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 51072 invoked by uid 500); 29 Dec 2014 17:35:58 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 51053 invoked by uid 99); 29 Dec 2014 17:35:54 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Dec 2014 17:35:54 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of shaun.senecal@lithium.com designates 157.56.110.131 as permitted sender) Received: from [157.56.110.131] (HELO na01-bn1-obe.outbound.protection.outlook.com) (157.56.110.131) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Dec 2014 17:35:28 +0000 Received: from BY1PR0401MB1461.namprd04.prod.outlook.com (25.161.206.21) by BY1PR0401MB1462.namprd04.prod.outlook.com (25.161.206.22) with Microsoft SMTP Server (TLS) id 15.1.49.12; Mon, 29 Dec 2014 17:34:21 +0000 Received: from BY1PR0401MB1461.namprd04.prod.outlook.com ([25.161.206.21]) by BY1PR0401MB1461.namprd04.prod.outlook.com ([25.161.206.21]) with mapi id 15.01.0049.002; Mon, 29 Dec 2014 17:34:21 +0000 From: Shaun Senecal To: java-user Subject: Re: manually merging Directories Thread-Topic: manually merging Directories Thread-Index: AQHQHwOHirukyZHxSUynwKCYfKdJF5yd+zyAgAje2Hk= Date: Mon, 29 Dec 2014 17:34:20 +0000 Message-ID: <1419874459990.94180@lithium.com> References: , In-Reply-To: Accept-Language: en-CA, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [12.22.22.11] authentication-results: spf=none (sender IP is ) smtp.mailfrom=shaun.senecal@lithium.com; x-microsoft-antispam: BCL:0;PCL:0;RULEID:;SRVR:BY1PR0401MB1462; x-exchange-antispam-report-test: UriScan:; x-exchange-antispam-report-cfa-test: BCL:0;PCL:0;RULEID:;SRVR:BY1PR0401MB1462; x-forefront-prvs: 0440AC9990 x-forefront-antispam-report: SFV:NSPM;SFS:(10019020)(6009001)(24454002)(51914003)(43544003)(252514010)(164054003)(189002)(377454003)(199003)(99286002)(87936001)(107886001)(66066001)(68736005)(106356001)(106116001)(2950100001)(105586002)(20776003)(31966008)(19580405001)(107046002)(102836002)(64706001)(46102003)(40100003)(99396003)(120916001)(36756003)(101416001)(76176999)(50986999)(97736003)(54356999)(86362001)(92566001)(2656002)(122556002)(19580395003)(117636001)(77156002)(21056001)(110136001)(4396001)(450100001);DIR:OUT;SFP:1102;SCL:1;SRVR:BY1PR0401MB1462;H:BY1PR0401MB1461.namprd04.prod.outlook.com;FPR:;SPF:None;MLV:sfv;PTR:InfoNoRecords;MX:3;A:1;LANG:en; received-spf: None (protection.outlook.com: lithium.com does not designate permitted sender hosts) Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: lithium.com X-MS-Exchange-CrossTenant-originalarrivaltime: 29 Dec 2014 17:34:20.7182 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 86b39057-dfc9-4a4c-9dd7-c5d47a5bc789 X-MS-Exchange-Transport-CrossTenantHeadersStamped: BY1PR0401MB1462 X-Virus-Checked: Checked by ClamAV on apache.org I'm not worried about the I/O right now, I'm "hoping I can do better", that= 's all. It sounds like the only actual complication here is building the s= egments_N file, which would list all of the newly renamed segments, so perh= aps this isn't impossible. That said, you're absolutely right about the po= ssibility of complications, so its debatable if doing something like this w= ould be worth it in the end. Thanks for the info=0A= =0A= =0A= =0A= Shaun=0A= =0A= =0A= ________________________________________=0A= From: Erick Erickson =0A= Sent: December 23, 2014 5:55 PM=0A= To: java-user=0A= Subject: Re: manually merging Directories=0A= =0A= I doubt this is going to work. I have to ask why you're=0A= worried about the I/O; this smacks of premature=0A= optimization. Not only do the files have to be moved, but=0A= the right control structures need to be in place to inform=0A= Solr (well, Lucene) exactly what files are current. There's=0A= a lot of room for programming errors here....=0A= =0A= segments_n is the file that tells Lucene which segments=0A= are active. There can only be one that's active so you'd have=0A= to somehow combine them all.=0A= =0A= I think this is a dubious proposition at best, all to avoid some=0A= I/O. How much I/O are we talking here? If it's a huge amount,=0A= I'm not at all sure you'll be able to _use_ your merged index.=0A= How many docs are we talking about? 100M? 10B? I mean=0A= you used M/R on it in the first place for a reason....=0A= =0A= But this is what the --go-live option of the MapReduceIndexerTool=0A= already does for you. Admittedly, it copies things around the=0A= network to the final destination, personally I'd just use that.=0A= =0A= As you can tell, I don't know all the details to say it's impossible,=0A= IMO this is feels like wasted effort with lots of possibilities to=0A= get wrong for little demonstrated benefit. You'd spend a lot more=0A= time trying to figure out the correct thing to do and then fixing=0A= bugs than you'll spend waiting for the copy HDFS or no.=0A= =0A= Best,=0A= Erick=0A= =0A= On Tue, Dec 23, 2014 at 2:55 PM, Shaun Senecal=0A= wrote:=0A= > Hi=0A= >=0A= > I have a number of Directories which are stored in various paths on HDFS,= and I would like to merge them into a single index. The obvious way to do= this is to use IndexWriter.addIndexes(...), however, I'm hoping I can do b= etter. Since I have created each of the separate indexes using Map/Reduce,= I know that there are no deleted or duplicate documents and the codecs are= the same. Using addIndexes(...) will incur a lot of I/O as it copies from= the source Directory into the dest Directory, and this is the bit I would = like to avoid. Would it instead be possible to simply move each of the seg= ments from each path into a single path on HDFS using a mv/rename operation= instead? Obviously I would need to take care of the naming to ensure the = files from one index dont overwrite another's, but it looks like this is do= ne with a counter of some sort so that the latest segment can be found. A p= otential complication is the segments_1 file, as I'm not sure what that is = for or if I can easily (re)construct it externally.=0A= >=0A= > The end goal here is to index using Map/Reduce and then spit out a single= index in the end that has been merged down to a single segment, and to min= imize IO while doing it. Once I have the completed index in a single Direc= tory, I can (optionally) perform the forced merge (which will incur a huge = IO hit). If the forced merge isnt performed on HDFS, it could be done on t= he search nodes before the active searcher is switched. This may be better= if, for example, you know all of your search nodes have SSDs and IO to spa= re.?=0A= >=0A= > Just in case my explanation above wasn't clear enough, here is a picture= =0A= >=0A= > What I have:=0A= >=0A= > /user/username/MR_output/0=0A= > _0.fdt=0A= > _0.fdx=0A= > _0.fnm=0A= > _0.si=0A= > ...=0A= > segments_1=0A= >=0A= > /user/username/MR_output/1=0A= > _0.fdt=0A= > _0.fdx=0A= > _0.fnm=0A= > _0.si=0A= > ...=0A= > segments_1=0A= >=0A= >=0A= > What I want (using simple mv/rename):=0A= >=0A= > /user/username/merged=0A= > _0.fdt=0A= > _0.fdx=0A= > _0.fnm=0A= > _0.si=0A= > ...=0A= > _1.fdt=0A= > _1.fdx=0A= > _1.fnm=0A= > _1.si=0A= > ...=0A= > segments_1=0A= >=0A= >=0A= >=0A= >=0A= > Thanks,=0A= >=0A= > Shaun?=0A= >=0A= =0A= ---------------------------------------------------------------------=0A= To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org=0A= For additional commands, e-mail: java-user-help@lucene.apache.org=0A= --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org