Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 62033200D51 for ; Thu, 23 Nov 2017 02:17:51 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 602C6160C0F; Thu, 23 Nov 2017 01:17:51 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 7ECDC160BFD for ; Thu, 23 Nov 2017 02:17:50 +0100 (CET) Received: (qmail 95430 invoked by uid 500); 23 Nov 2017 01:17:48 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 95418 invoked by uid 99); 23 Nov 2017 01:17:48 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Nov 2017 01:17:48 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id A3FA9C17A8 for ; Thu, 23 Nov 2017 01:17:47 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.079 X-Spam-Level: X-Spam-Status: No, score=0.079 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, KB_WAM_FROM_NAME_SINGLEWORD=0.2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 5IecXwKJxgwT for ; Thu, 23 Nov 2017 01:17:46 +0000 (UTC) Received: from mail-lf0-f41.google.com (mail-lf0-f41.google.com [209.85.215.41]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id A4A765FBEE for ; Thu, 23 Nov 2017 01:17:45 +0000 (UTC) Received: by mail-lf0-f41.google.com with SMTP id f134so20214830lfg.8 for ; Wed, 22 Nov 2017 17:17:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=ynjvl2oBynSa+FgbtFUVJcRNM4POFShPXn/A0e4Yfow=; b=lRWZratnGOZIRUNqtEfSO5nTsvKZqFTQyccGkbHmBghUZKChmhxnHBplInIxQe1Uqd xAWhbSMCpsFOCm92HAM4HQZdp3es7PCsPQ1/DuUC6i1ODLkhc9sv8ZpGW0SiedTvbOo7 E1UmxkTjwbEWcn2rE+TAR/Jl9Fp8Tql0xLJP7VxBMWl5TifaO/eEgFRI0kuwJ+c9VMfB sYGllWCwEfaG4rYcpoqWN/vx5E7VrIEtckHIBDJ27YXHge0yAmI2wusN6dbxokoORp1r S7G09oDQkZSblYKJTbubTgfUxJ5SigZVTYc14/OQRM+dNny+p18dlA4aAUKeLFSLQxdo O+iw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=ynjvl2oBynSa+FgbtFUVJcRNM4POFShPXn/A0e4Yfow=; b=Pp/e47635ij87NOEy8uPRlhncqVhwx5/Ye7wnnUJeJZaHJ9QUnxTJ6++3dZDUg3KYU Soaz53BNVz1GBgjKraTVfAFa2eYEPtNBg5XSPpCqCwe/xFOn/buvB/qpijbjNwEw3RDh +M2q9kam+0v6FsbPuVhS7JY1U5caRpliUvyobGmmvoOwylMtl31lI1RrlIZo4m7+QaZR PTZGsDmeFvJyO8WsDUOsCKopZoa+BX22nul47QiCEMIkEUVRZJOCf58eh9KJq//f0N7T KYqE6t0xTI6j6GQHFvNTteEwhVXgUN5+/au14y/NeMmMlDYsSq8g/zSsi/ht+rjEgQbe fO1g== X-Gm-Message-State: AJaThX5lch/sjGY7TxPGjqNX79bS7EYe2ccUiG/6E61InNH0Fz9Ih7mL EofCys/9EOEB8ekfFs79owhj00Tm4aqUx4+5UbFrvQ== X-Google-Smtp-Source: AGs4zMZXnsYnEvibRIJwedOTCR6WMsi86tT5l7jUN3qGiJPdPOAeZk/ynvgWqetPyEV7tBbZ8KZudMmb/WJtgvkauFM= X-Received: by 10.25.193.68 with SMTP id r65mr6594672lff.104.1511399864197; Wed, 22 Nov 2017 17:17:44 -0800 (PST) MIME-Version: 1.0 Received: by 10.25.56.81 with HTTP; Wed, 22 Nov 2017 17:17:03 -0800 (PST) In-Reply-To: References: <87498945-0fca-1b01-159a-0a9f19e0811d@gmail.com> <963285b0-59f5-f197-c824-9101bf58bf04@gmx.net> <140cb0d4-11fd-1ad2-d94a-e611815802a4@gmail.com> <0f25f212-16ef-d016-61e0-fd848611e1bf@gmx.net> <08b616c0-e90f-f809-3c36-f41a29cfd752@gmail.com> <2a5db24b-4407-90d9-f7cf-b112cbc9e711@gmail.com> From: Erick Erickson Date: Wed, 22 Nov 2017 17:17:03 -0800 Message-ID: Subject: Re: Recovery Issue - Solr 6.6.1 and HDFS To: solr-user Content-Type: text/plain; charset="UTF-8" archived-at: Thu, 23 Nov 2017 01:17:51 -0000 Hmm. This is quite possible. Any time things take "too long" it can be a problem. For instance, if the leader sends docs to a replica and the request times out, the leader throws the follower into "Leader Initiated Recovery". The smoking gun here is that there are no errors on the follower, just the notification that the leader put it into recovery. There are other variations on the theme, it all boils down to when communications fall apart replicas go into recovery..... Best, Erick On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger wrote: > Hi Shawn - thank you for your reply. The index is 29.9TBytes as reported > by: > hadoop fs -du -s -h /solr6.6.0 > 29.9 T 89.9 T /solr6.6.0 > > The 89.9TBytes is due to HDFS having 3x replication. There are about 1.1 > billion documents indexed and we index about 2.5 million documents per day. > Assuming an even distribution, each node is handling about 680GBytes of > index. So our cache size is 1.4%. Perhaps 'relatively small block cache' > was an understatement! This is why we split the largest collection into two, > where one is data going back 30 days, and the other is all the data. Most > of our searches are not longer than 30 days back. The 30 day index is > 2.6TBytes total. I don't know how the HDFS block cache splits between > collections, but the 30 day index performs acceptable for our specific > application. > > If we wanted to cache 50% of the index, each of our 45 nodes would need a > block cache of about 350GBytes. I'm accepting offers of DIMMs! > > What I believe caused our 'recovery, fail, retry loop' was one of our > servers died. This caused HDFS to start to replicate blocks across the > cluster and produced a lot of network activity. When this happened, I > believe there was high network contention for specific nodes in the cluster > and their network interfaces became pegged and requests for HDFS blocks > timed out. When that happened, SolrCloud went into recovery which caused > more network traffic. Fun stuff. > > -Joe > > > On 11/22/2017 11:44 AM, Shawn Heisey wrote: >> >> On 11/22/2017 6:44 AM, Joe Obernberger wrote: >>> >>> Right now, we have a relatively small block cache due to the >>> requirements that the servers run other software. We tried to find >>> the best balance between block cache size, and RAM for programs, while >>> still giving enough for local FS cache. This came out to be 84 128M >>> blocks - or about 10G for the cache per node (45 nodes total). >> >> How much data is being handled on a server with 10GB allocated for >> caching HDFS data? >> >> The first message in this thread says the index size is 31TB, which is >> *enormous*. You have also said that the index takes 93TB of disk >> space. If the data is distributed somewhat evenly, then the answer to >> my question would be that each of those 45 Solr servers would be >> handling over 2TB of data. A 10GB cache is *nothing* compared to 2TB. >> >> When index data that Solr needs to access for an operation is not in the >> cache and Solr must actually wait for disk and/or network I/O, the >> resulting performance usually isn't very good. In most cases you don't >> need to have enough memory to fully cache the index data ... but less >> than half a percent is not going to be enough. >> >> Thanks, >> Shawn >> >> >> --- >> This email has been checked for viruses by AVG. >> http://www.avg.com >> >