Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 9A4AC200B5A for ; Thu, 4 Aug 2016 09:31:44 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 98D40160AAB; Thu, 4 Aug 2016 07:31:44 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B578B160A7C for ; Thu, 4 Aug 2016 09:31:43 +0200 (CEST) Received: (qmail 47594 invoked by uid 500); 4 Aug 2016 07:31:42 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 47583 invoked by uid 99); 4 Aug 2016 07:31:41 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Aug 2016 07:31:41 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 42EF3C0376 for ; Thu, 4 Aug 2016 07:31:41 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -3.727 X-Spam-Level: X-Spam-Status: No, score=-3.727 tagged_above=-999 required=6.31 tests=[RCVD_IN_DNSWL_MED=-2.3, RP_MATCHES_RCVD=-1.426, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 35FDF-cbWlHC for ; Thu, 4 Aug 2016 07:31:37 +0000 (UTC) Received: from unibi-smtp-b.hrz.uni-bielefeld.de (unibi-smtp-b.hrz.uni-bielefeld.de [129.70.208.22]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 2A7D55FC0E for ; Thu, 4 Aug 2016 07:31:31 +0000 (UTC) MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; CHARSET=US-ASCII Received: from [129.70.11.68] ([129.70.11.68]) by unibi-smtp-b.hrz.uni-bielefeld.de (Oracle Communications Messaging Server 7.0.5.37.0 64bit (built Jan 25 2016)) with ESMTPPA id <0OBD00LOKJKC3N90@unibi-smtp-b.hrz.uni-bielefeld.de> for solr-user@lucene.apache.org; Thu, 04 Aug 2016 09:31:24 +0200 (CEST) X-Connecting-IP: [129.70.11.68] X-PMX-Version: 6.2.0.2453472, Antispam-Engine: 2.7.2.2107409, Antispam-Data: 2016.8.4.72417, pmx12 X-EnvFrom: bernd.fehling@uni-bielefeld.de Subject: Re: problems with bulk indexing with concurrent DIH To: solr-user@lucene.apache.org References: <65db1341-f46a-7ea3-4ea9-a8c706b01a2e@uni-bielefeld.de> <28034_1469628271_u6RE4UWi019021_CAN4YXvdsAE7yiFuNCWThU4nJoi+BD7OS4YiwqWnat=3M7knb8Q@mail.gmail.com> <10932_1469630487_u6REfQWp011793_CAN4YXveBm_ufYpaZEmz0FOxhZdK3NT065VuUDg7E9+Mn_CwkeA@mail.gmail.com> <632c67cc-6570-dde3-846f-536cc441755d@uni-bielefeld.de> <3ae0d341-37ed-596f-ccd8-3d1269846ca4@uni-bielefeld.de> From: Bernd Fehling Message-id: <64769e20-89a9-632c-d61b-5798ed78c8df@uni-bielefeld.de> Date: Thu, 04 Aug 2016 09:31:24 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2 In-reply-to: archived-at: Thu, 04 Aug 2016 07:31:44 -0000 After updating to version 5.5.3 it looks good now. I think LUCENE-6161 has fixed my problem. Nevertheless, after updating my development system and recompyling my plugins I will have a look at DIH about the "update" and also your advise about the uniqueKey. Best regards Bernd Am 02.08.2016 um 16:16 schrieb Mikhail Khludnev: > These deletes seem really puzzling to me. Can you experiment with > commenting uniqeKey in schema.xml. My expectation that deletes should go > away after that. > > On Tue, Aug 2, 2016 at 4:50 PM, Bernd Fehling < > bernd.fehling@uni-bielefeld.de> wrote: > >> Hi Mikhail, >> >> there are no deletes at all from my point of view. >> All records have unique id's. >> No sharding at all, it is a single index and it is certified >> that all DIH's get different data to load and no record is >> sent twice to any DIH participating at concurrent loading. >> >> Only assumption so far, DIH is sending the records as "update" >> (and not pure "add") to the indexer which will generate delete >> files during merge. If the number of segments is high it will >> take quite long to merge and check all records of all segments. >> >> I'm currently setting up SOLR 5.5.3 but that takes a while. >> I also located an "overwrite" parameter somewhere in DIH which >> will force an "add" and not an "update" to the index, but >> couldn't figure out how to set the parameter with command. >> >> Bernd >> >> >> Am 02.08.2016 um 15:15 schrieb Mikhail Khludnev: >>> Bernd, >>> But why do you have so many deletes? Is it expected? >>> When you run DIHs concurrently, do you shard intput data by uniqueKey? >>> >>> On Wed, Jul 27, 2016 at 6:20 PM, Bernd Fehling < >>> bernd.fehling@uni-bielefeld.de> wrote: >>> >>>> If there is a problem in single index then it might also be in >> CloudSolr. >>>> As far as I could figure out from INFOSTREAM, documents are added to >>>> segments >>>> and terms are "collected". Duplicate term are "deleted" (or whatever). >>>> These deletes (or whatever) are not concurrent. >>>> I have a lines like: >>>> BD 0 [Wed Jul 27 13:28:48 GMT+01:00 2016; Thread-27879]: applyDeletes: >>>> infos=... >>>> BD 0 [Wed Jul 27 13:31:48 GMT+01:00 2016; Thread-27879]: applyDeletes >> took >>>> 180028 msec >>>> ... >>>> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes: >>>> infos=... >>>> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes >> took >>>> 3411845 msec >>>> >>>> 3411545 msec are about 56 minutes where the system is doing what??? >>>> At least not indexing because only one JAVA process and no I/O at all! >>>> >>>> How can SolrJ help me now with this problem? >>>> >>>> Best >>>> Bernd >>>> >>>> >>>> Am 27.07.2016 um 16:41 schrieb Erick Erickson: >>>>> Well, at least it'll be easier to debug in my experience. Simple >> example. >>>>> At some point you'll call CloudSolrClient.add(doc list). Comment just >>>> that >>>>> out and you'll be able to isolate whether the issue is querying the be >> or >>>>> sending to Solr. >>>>> >>>>> Then CloudSolrClient (assuming SolrCloud) has efficiencies in terms of >>>>> routing... >>>>> >>>>> Best >>>>> Erick >>>>> >>>>> On Jul 27, 2016 7:24 AM, "Bernd Fehling" < >> bernd.fehling@uni-bielefeld.de >>>>> >>>>> wrote: >>>>> >>>>>> So writing some SolrJ doing the same job as the DIH script >>>>>> and using that concurrent will solve my problem? >>>>>> I'm not using Tika. >>>>>> >>>>>> I don't think that DIH is my problem, even if it is not the best >>>> solution >>>>>> right now. >>>>>> Nevertheless, you are right SolrJ has higher performance, but what >>>>>> if I have the same problems with SolrJ like with DIH? >>>>>> >>>>>> If it runs with DIH it should run with SolrJ with additional >> performance >>>>>> boost. >>>>>> >>>>>> Bernd >>>>>> >>>>>> >>>>>> On 27.07.2016 at 16:03, Erick Erickson: >>>>>>> I'd actually recommend you move to a SolrJ solution >>>>>>> or similar. Currently, you're putting a load on the Solr >>>>>>> servers (especially if you're also using Tika) in addition >>>>>>> to all indexing etc. >>>>>>> >>>>>>> Here's a sample: >>>>>>> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/ >>>>>>> >>>>>>> Dodging the question I know, but DIH sometimes isn't >>>>>>> the best solution. >>>>>>> >>>>>>> Best, >>>>>>> Erick >>>>>>> >>>>>>> On Wed, Jul 27, 2016 at 6:59 AM, Bernd Fehling >>>>>>> wrote: >>>>>>>> After enhancing the server with SSDs I'm trying to speed up >> indexing. >>>>>>>> >>>>>>>> The server has 16 CPUs and more than 100G RAM. >>>>>>>> JAVA (1.8.0_92) has 24G. >>>>>>>> SOLR is 4.10.4. >>>>>>>> Plain XML data to load is 218G with about 96M records. >>>>>>>> This will result in a single index of 299G. >>>>>>>> >>>>>>>> I tried with 4, 8, 12 and 16 concurrent DIHs. >>>>>>>> 16 and 12 was to much because for 16 CPUs and my test continued >> with 8 >>>>>> concurrent DIHs. >>>>>>>> Then i was trying different and >> settings >>>>>> but now I'm stuck. >>>>>>>> I can't figure out what is the best setting for bulk indexing. >>>>>>>> What I see is that the indexing is "falling asleep" after some time >> of >>>>>> indexing. >>>>>>>> It is only producing del-files, like _11_1.del, _w_2.del, >> _h_3.del,... >>>>>>>> >>>>>>>> >>>>>>>> 8 >>>>>>>> 1024 >>>>>>>> -1 >>>>>>>> >>>>>>>> 8 >>>>>>>> 100 >>>>>>>> 512 >>>>>>>> >>>>>>>> 8 >>>>>>>> >>>>> class="org.apache.lucene.index.ConcurrentMergeScheduler"/> >>>>>>>> ${solr.lock.type:native} >>>>>>>> ... >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ### no autocommit at all >>>>>>>> >>>>>>>> ${solr.autoSoftCommit.maxTime:-1} >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>> >> command=full-import&optimize=false&clean=false&commit=false&waitSearcher=false >>>>>>>> After indexing finishes there is a final optimize. >>>>>>>> >>>>>>>> My idea is, if 8 DIHs use 8 CPUs then I have 8 CPUs left for merging >>>>>>>> (maxIndexingThreads/maxMergeAtOnce/mergeFactor). >>>>>>>> It should do no commit, no optimize. >>>>>>>> ramBufferSizeMB is high because I have plenty of RAM and I want make >>>>>> use the speed of RAM. >>>>>>>> segmentsPerTier is high to reduce merging. >>>>>>>> >>>>>>>> But somewhere is a misconfiguration because indexing gets stalled. >>>>>>>> >>>>>>>> Any idea what's going wrong? >>>>>>>> >>>>>>>> >>>>>>>> Bernd >>>>>>>> >>>>>> >>>>>