Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 1DDE1200C3D for ; Tue, 14 Mar 2017 09:23:22 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 1C69B160B7E; Tue, 14 Mar 2017 08:23:22 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 3FBD1160B7C for ; Tue, 14 Mar 2017 09:23:21 +0100 (CET) Received: (qmail 41163 invoked by uid 500); 14 Mar 2017 08:23:19 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 41151 invoked by uid 99); 14 Mar 2017 08:23:19 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Mar 2017 08:23:19 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 0312B184962 for ; Tue, 14 Mar 2017 08:23:19 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.38 X-Spam-Level: ** X-Spam-Status: No, score=2.38 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id GO6lrdmaiwTq for ; Tue, 14 Mar 2017 08:23:16 +0000 (UTC) Received: from mail-it0-f44.google.com (mail-it0-f44.google.com [209.85.214.44]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 58F135F39E for ; Tue, 14 Mar 2017 08:23:15 +0000 (UTC) Received: by mail-it0-f44.google.com with SMTP id w124so26505686itb.1 for ; Tue, 14 Mar 2017 01:23:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=vn0VQNXokYxvf3LSwStziCY5fg5QlsgAgx9MWkv4wSI=; b=UQoSbKIuk2uH5ckXaEKop6HvDJFiNvoPB528PfLeheF0hPvUgDvzdizua95+cJZynR NOqDRlwTlIIn1AyA5RPQ3AKKWDggPqvMpzaNO3BmwWCG5+EvqV7w7xYJYfb9aL4JbtjZ RhWTSAwN14i5UbqRv/iNnmL1BwMAiRp51Nn9NrDmvSGxFWsnqJnqGRTPzGuvf6SvPoO7 PLGqRVzISGK4TK9ia0H7TNSBNsjRzgjsdO1rPmU8PBiDEYFTGKv3eGOc8wVAjDlXnreV gdLtacDyxq8KualYrpV5KlumUGzAjzA2DTwbRkJpuMGiaGTcQoAgF532GgoXxLg6qgDN 2UkA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=vn0VQNXokYxvf3LSwStziCY5fg5QlsgAgx9MWkv4wSI=; b=FntDsfwmpm5HPlt+N7ttECbHqF18WTAs1NDVhKU9e+TaharegnKVLgvkPvP3xL5Ppr yqmoxr7sZBEz7ktBypMdAuWliSwBgN6tlm7nT9BPfSzZIsoxFk1GUIrWyXywXxetIdP7 kDqWWA8846tXzwg0qi8GLojqV/5GRes5KvMO7aJzyGAK2MTjgwMZeFNVQZS6CcbVUojL +HgWh++7od+UttOSxeiBjTFORDMqeFZP9GfrGjF6apje/dILFu4Z6BWW0GJ7jj8p6gvm +p65Ox89M9WjzZVwbyLMdDDKsd0cjAJzWNZ/7WAlzZrttSgPmXI6yiiR74h38rHNBe6M OrWA== X-Gm-Message-State: AFeK/H2SuTPDPKd3wSlu4xwY9XXx+LHGoZ2ufJO/bk1OS/V7m+riri430ota62IFQ2J5R+ZITq8GN6hXDxnPIw== X-Received: by 10.36.78.201 with SMTP id r192mr14323200ita.96.1489479794065; Tue, 14 Mar 2017 01:23:14 -0700 (PDT) MIME-Version: 1.0 Received: by 10.79.144.87 with HTTP; Tue, 14 Mar 2017 01:23:13 -0700 (PDT) In-Reply-To: References: From: Mahmoud Almokadem Date: Tue, 14 Mar 2017 10:23:13 +0200 Message-ID: Subject: Re: Indexing CPU performance To: "solr-user@lucene.apache.org" Content-Type: multipart/alternative; boundary=001a1143d39606878c054aac8af8 archived-at: Tue, 14 Mar 2017 08:23:22 -0000 --001a1143d39606878c054aac8af8 Content-Type: text/plain; charset=UTF-8 I'm using VisualVM and sematext to monitor my cluster. Below is screenshots for each of them. https://drive.google.com/open?id=0BwLcshoSCVcdWHRJeUNyekxWN28 https://drive.google.com/open?id=0BwLcshoSCVcdZzhTRGVjYVJBUzA https://drive.google.com/open?id=0BwLcshoSCVcdc0dQZGJtMWxDOFk https://drive.google.com/open?id=0BwLcshoSCVcdR3hJSHRZTjdSZm8 https://drive.google.com/open?id=0BwLcshoSCVcdUzRETDlFeFIxU2M Thanks, Mahmoud On Tue, Mar 14, 2017 at 10:20 AM, Mahmoud Almokadem wrote: > Thanks Erick, > > I think there are something missing, the rate I'm talking about is for > bulk upload and one time indexing to on-going indexing. > My dataset is about 250 million documents and I need to index them to solr. > > Thanks Shawn for your clarification, > > I think that I got stuck on this version 6.4.1 I'll upgrade my cluster and > test again. > > Thanks for help > Mahmoud > > > On Tue, Mar 14, 2017 at 1:20 AM, Shawn Heisey wrote: > >> On 3/13/2017 7:58 AM, Mahmoud Almokadem wrote: >> > When I start my bulk indexer program the CPU utilization is 100% on each >> > server but the rate of the indexer is about 1500 docs per second. >> > >> > I know that some solr benchmarks reached 70,000+ doc. per second. >> >> There are *MANY* factors that affect indexing rate. When you say that >> the CPU utilization is 100 percent, what operating system are you >> running and what tool are you using to see CPU percentage? Within that >> tool, where are you looking to see that usage level? >> >> On some operating systems with some reporting tools, a server with 8 CPU >> cores can show up to 800 percent CPU usage, so 100 percent utilization >> on the Solr process may not be full utilization of the server's >> resources. It also might be an indicator of the full system usage, if >> you are looking in the right place. >> >> > The question: What is the best way to determine the bottleneck on solr >> > indexing rate? >> >> I have two likely candidates for you. The first one is a bug that >> affects Solr 6.4.0 and 6.4.1, which is fixed by 6.4.2. If you don't >> have one of those two versions, then this is not affecting you: >> >> https://issues.apache.org/jira/browse/SOLR-10130 >> >> The other likely bottleneck, which could be a problem whether or not the >> previous bug is present, is single-threaded indexing, so every batch of >> docs must wait for the previous batch to finish before it can begin, and >> only one CPU gets utilized on the server side. Both Solr and SolrJ are >> fully capable of handling several indexing threads at once, and that is >> really the only way to achieve maximum indexing performance. If you >> want multi-threaded (parallel) indexing, you must create the threads on >> the client side, or run multiple indexing processes that each handle >> part of the job. Multi-threaded code is not easy to write correctly. >> >> The fieldTypes and analysis that you have configured in your schema may >> include classes that process very slowly, or may include so many filters >> that the end result is slow performance. I am not familiar with the >> performance of the classes that Solr includes, so I would not be able to >> look at a schema and tell you which entries are slow. As Erick >> mentioned, processing for 300+ fields could be one reason for slow >> indexing. >> >> If you are doing a commit operation for every batch, that will slow it >> down even more. If you have autoSoftCommit configured with a very low >> maxTime or maxDocs value, that can result in extremely frequent commits >> that make indexing much slower. Although frequent autoCommit is very >> much desirable for good operation (as long as openSearcher set to >> false), commits that open new searchers should be much less frequent. >> The best option is to only commit (with a new searcher) *once* at the >> end of the indexing run. If automatic soft commits are desired, make >> them happen as infrequently as you can. >> >> https://lucidworks.com/understanding-transaction-logs- >> softcommit-and-commit-in-sorlcloud/ >> >> Using CloudSolrClient will make single-threaded indexing fairly >> efficient, by always sending documents to the correct shard leader. FYI >> -- your 500 document batches are split into smaller batches (which I >> think are only 10 documents) that are directed to correct shard leaders >> by CloudSolrClient. Indexing with multiple threads becomes even more >> important with these smaller batches. >> >> Note that with SolrJ, you will need to tweak the HttpClient creation, or >> you will likely find that each SolrJ client object can only utilize two >> threads to each Solr server. The default per-route maximum connection >> limit for HttpClient is 2, with a total connection limit of 20. >> >> This code snippet shows how I create a Solr client that can do many >> threads (300 per route, 5000 total) and also has custom timeout settings: >> >> RequestConfig rc = RequestConfig.custom().setConnectTimeout(15000) >> .setSocketTimeout(Const.SOCKET_TIMEOUT).build(); >> httpClient = HttpClients.custom().setDefaultRequestConfig(rc) >> .setMaxConnPerRoute(300).setMaxConnTotal(5000) >> .disableAutomaticRetries().build(); >> client = new HttpSolrClient(serverBaseUrl, httpClient); >> >> This is using HttpSolrClient, but CloudSolrClient can be built in a >> similar manner. I am not yet using the new SolrJ Builder paradigm found >> in 6.x, I should switch my code to that. >> >> Thanks, >> Shawn >> >> > --001a1143d39606878c054aac8af8--