Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3E0C0184CC for ; Fri, 22 May 2015 09:15:44 +0000 (UTC) Received: (qmail 10312 invoked by uid 500); 22 May 2015 09:15:38 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 10246 invoked by uid 500); 22 May 2015 09:15:37 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 10234 invoked by uid 99); 22 May 2015 09:15:37 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 May 2015 09:15:37 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 3BA68C67BD for ; Fri, 22 May 2015 09:15:37 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.9 X-Spam-Level: ** X-Spam-Status: No, score=2.9 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 3qtyFrTVFHAr for ; Fri, 22 May 2015 09:15:28 +0000 (UTC) Received: from mail-la0-f42.google.com (mail-la0-f42.google.com [209.85.215.42]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 5AC5A20C59 for ; Fri, 22 May 2015 09:15:27 +0000 (UTC) Received: by lami4 with SMTP id i4so8436795lam.0 for ; Fri, 22 May 2015 02:15:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=rbHVMZ5X7oh2xetoCWdl+FNJ/BcHCbRWTxV98WECZoc=; b=P9+vmkF+NArKKLsxvTBIB2/8U6p+qgnYmoP0C0cfpRgJXDl4BF/jcapZZAfLxVC7GK CRQdU6mMxxb4wuZDiKcNA4IotbYN/4kZcJ+NjHS9HyC2QUgNf8Tqikhm45eeDadeTOEx rBiKVa4Kp7PLkz11yrye5TaS2gRllPnRWvpdwh4wRb/3TIDyaC6f81rdIHfy1hGQs6Ui hGRxq43DazwYSmqIx/of6mXeXHuBy4dw+/dbQYUKb0gIUnK+duijoG8qInyZ1Zt8GPZ6 YYiSBYQtpxmPaFQI3kWj0eY1giuV2GNM5sJ0rInagvhjZA/JV93/hhX0wgxZQ2ZeR1Nz o2qg== MIME-Version: 1.0 X-Received: by 10.152.27.105 with SMTP id s9mr5667255lag.86.1432286126761; Fri, 22 May 2015 02:15:26 -0700 (PDT) Received: by 10.25.215.32 with HTTP; Fri, 22 May 2015 02:15:26 -0700 (PDT) In-Reply-To: References: <555D9907.5060506@elyograg.org> Date: Fri, 22 May 2015 12:15:26 +0300 Message-ID: Subject: Re: Indexing gets significantly slower after every batch commit From: Angel Todorov To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary=089e0160b87ccd6e2f0516a819cb --089e0160b87ccd6e2f0516a819cb Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Thanks for the feedback guys. What i am going to try now is deploying my SOLR server on a physical machine with more RAM, and checking out this scenario there. I have some suspicion it could well be a hypervisor issue, but let's see. Just for the record - I've noticed those issues on a Win 2008R2 VM with 8 GB of RAM and 2 cores. I don't see anything strange in the logs. One thing that I need to change, though, is the verbosity of logs in the console - looks like by default SOLR outputs text in the log for every single document that's indexed, as well as for every query that's executed. Angel On Fri, May 22, 2015 at 1:03 AM, Erick Erickson wrote: > bq: Which is logical as index growth and time needed to put something > to it is log(n) > > Not really. Solr indexes to segments, each segment is a fully > consistent "mini index". > When a segment gets flushed to disk, a new one is started. Of course > there'll be a > _little bit_ of added overyead, but it shouldn't be all that noticeable. > > Furthermore, they're "append only". In the past, when I've indexed the > Wiki example, > my indexing speed actually goes faster. > > So on the surface this sounds very strange to me. Are you seeing > anything at all in the > Solr logs that's supsicious? > > Best, > Erick > > On Thu, May 21, 2015 at 12:22 PM, Sergey Shvets > wrote: > > Hi Angel > > > > We also noticed that kind of performance degrade in our workloads. > > > > Which is logical as index growth and time needed to put something to it > is > > log(n) > > > > > > > > =D1=87=D0=B5=D1=82=D0=B2=D0=B5=D1=80=D0=B3, 21 =D0=BC=D0=B0=D1=8F 2015 = =D0=B3. =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB= =D1=8C Angel Todorov =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB: > > > >> hi Shawn, > >> > >> Thanks a bunch for your feedback. I've played with the heap size, but = I > >> don't see any improvement. Even if i index, say , a million docs, and > the > >> throughput is about 300 docs per sec, and then I shut down solr > completely > >> - after I start indexing again, the throughput is dropping below 300. > >> > >> I should probably experiment with sharding those documents to multiple > SOLR > >> cores - that should help, I guess. I am talking about something like > this: > >> > >> > >> > https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data= +in+SolrCloud > >> > >> Thanks, > >> Angel > >> > >> > >> On Thu, May 21, 2015 at 11:36 AM, Shawn Heisey >> > wrote: > >> > >> > On 5/21/2015 2:07 AM, Angel Todorov wrote: > >> > > I'm crawling a file system folder and indexing 10 million docs, an= d > I > >> am > >> > > adding them in batches of 5000, committing every 50 000 docs. The > >> > problem I > >> > > am facing is that after each commit, the documents per sec that ar= e > >> > indexed > >> > > gets less and less. > >> > > > >> > > If I do not commit at all, I can index those docs very quickly, an= d > >> then > >> > I > >> > > commit once at the end, but once i start indexing docs _after_ tha= t > >> (for > >> > > example new files get added to the folder), indexing is also slowi= ng > >> > down a > >> > > lot. > >> > > > >> > > Is it normal that the SOLR indexing speed depends on the number of > >> > > documents that are _already_ indexed? I think it shouldn't matter > if i > >> > > start from scratch or I index a document in a core that already ha= s > a > >> > > couple of million docs. Looks like SOLR is either doing something > in a > >> > > linear fashion, or there is some magic config parameter that I am > not > >> > aware > >> > > of. > >> > > > >> > > I've read all perf docs, and I've tried changing mergeFactor, > >> > > autowarmCounts, and the buffer sizes - to no avail. > >> > > > >> > > I am using SOLR 5.1 > >> > > >> > Have you changed the heap size? If you use the bin/solr script to > start > >> > it and don't change the heap size with the -m option or another > method, > >> > Solr 5.1 runs with a default size of 512MB, which is *very* small. > >> > > >> > I bet you are running into problems with frequent and then ultimatel= y > >> > constant garbage collection, as Java attempts to free up enough memo= ry > >> > to allow the program to continue running. If that is what is > happening, > >> > then eventually you will see an OutOfMemoryError exception. The > >> > solution is to increase the heap size. I would probably start with = at > >> > least 4G for 10 million docs. > >> > > >> > Thanks, > >> > Shawn > >> > > >> > > >> > --089e0160b87ccd6e2f0516a819cb--