lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Re: Page faults
Date Tue, 08 Jan 2019 00:12:51 GMT
having some replicas at 90G and some at 18G is totally unexpected with
compisiteID routing unless you're using "multi-level routing", see:
https://lucidworks.com/2014/01/06/multi-level-composite-id-routing-solrcloud/

But let's be clear what we're talking about here. I'm talking about
specifically the size of the index on disk for any particular
_replica_, meaning the size in places similar to:
pdv201806_shard1_replica1/data/index. I've never seen as much
disparity as you're talking about so we should get to the bottom of
that.

Do you have massive numbers of deleted docs in any of those shards?
The admin screen for any particular replica will show this number.


On another note: Your cache sizes are probably not part of the page
fault question, but on the surface they're badly misconfigured, at
least the filterCache and queryResultCache. Each entry in the
filterCache is a map entry, the key is roughly the query and the value
is bounded by maxDoc/8. So if you have, say, 8M documents, your
filterCache could theoretically be 1M each (give or take) and you
could have up to 20,000 of them. You're probably just being lucky and
either not having very many distinct fq clauses or are indexing often
enough that it isn't growing for very long before being flushed.

Your queryResultCache takes up a lot less space, but still it's quite
large. It has two primary purposes:
> paging. It generally stores a few integers (40 is common, maybe several hundred but who
cares?) so hitting the next page won't have to search again. This isn't terribly important
in modern installations.

> being used in autowarming to pre-load parts of the index into memory.

I'd consider knocking each of these back to the defaults (512), except
I'd put the autowarm count at, say, 16 or so.

The document cache is less clear, the recommendation is (number of
simultaneous queries you expect) X (your average row parameter)

Best,
Erick

On Mon, Jan 7, 2019 at 12:43 PM Branham, Jeremy (Experis)
<jbrcm@allstate.com> wrote:
>
> Thanks Erick/Chris for the information.
> The page faults are occurring on each node of the cluster.
> These are VMs running SOLR v7.2.1 on RHEL 7. CPUx8, 64GB mem.
>
> We’re collecting GC information and using a DynaTrace agent, so I’m not sure if /
how much that contributes to the overhead.
>
> This cluster is used strictly for type-ahead/auto-complete functionality.
>
> I’ve also just noticed that the shards are imbalanced – 2 having about 90GB and 2
having about 18GB of data.
> Having just joined this team, I’m not too familiar yet with the documents or queries/updates
[and maybe not relevant to the page faults].
> Although, I did check the schema, and most of the fields are stored=true, docValues=true
>
> Solr v7.2.1
> OS: RHEL 7
>
> Collection Configuration -
> Shard count: 4
> configName: pdv201806
> replicationFactor: 2
> maxShardsPerNode: 1
> router: compositeId
> autoAddReplicas: false
>
> Cache configuration –
> filterCache class="solr.FastLRUCache"
>                  size="20000"
>                  initialSize="5000"
>                  autowarmCount="10"
> queryResultCache class="solr.LRUCache"
>                       size="5000"
>                       initialSize="1000"
>                       autowarmCount="0"
> documentCache class="solr.LRUCache"
>                    size="15000"
>                    initialSize="512"
>
> enableLazyFieldLoading=true
>
>
> JVM Information/Configuration –
> java.runtime.version: 1.8.0_162-b12
>
> -XX:+CMSParallelRemarkEnabled
> -XX:+CMSScavengeBeforeRemark
> -XX:+ParallelRefProcEnabled
> -XX:+PrintGCApplicationStoppedTime
> -XX:+PrintGCDateStamps
> -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps
> -XX:+PrintHeapAtGC
> -XX:+PrintTenuringDistribution
> -XX:+ScavengeBeforeFullGC
> -XX:+UseCMSInitiatingOccupancyOnly
> -XX:+UseConcMarkSweepGC
> -XX:+UseGCLogFileRotation
> -XX:+UseParNewGC
> -XX:-OmitStackTraceInFastThrow
> -XX:CMSInitiatingOccupancyFraction=70
> -XX:CMSMaxAbortablePrecleanTime=6000
> -XX:ConcGCThreads=4
> -XX:GCLogFileSize=20M
> -XX:MaxTenuringThreshold=8
> -XX:NewRatio=3
> -XX:ParallelGCThreads=8
> -XX:PretenureSizeThreshold=64m
> -XX:SurvivorRatio=4
> -XX:TargetSurvivorRatio=90
> -Xms16g
> -Xmx32g
> -Xss256k
> -verbose:gc
>
>
>
> Jeremy Branham
> jbrcm@allstate.com
>
> On 1/7/19, 1:16 PM, "Christopher Schultz" <chris@christopherschultz.net> wrote:
>
>     -----BEGIN PGP SIGNED MESSAGE-----
>     Hash: SHA256
>
>     Erick,
>
>     On 1/7/19 11:52, Erick Erickson wrote:
>     > Images do not come through, so we don't see what you're seeing.
>     >
>     > That said, I'd expect page faults to happen:
>     >
>     > 1> when indexing. Besides what you'd expect (new segments written
>     > to disk), there's segment merging going on in the background which
>     > has to read segments from disk in order to merge.
>     >
>     > 2> when querying, any fields returned as part of a doc that has
>     > stored=true docValues=false will require a disk access to get the
>     > stored data.
>
>     A page fault is not necessarily a disk access. It almost always *is*,
>     but it's not because the application is calling fopen(). It's because
>     the OS is performing a memory operation which often results in a dip
>     into virtual memory.
>
>     Jeremy, are these page-faults occurring on all the machines in your
>     cluster, or only some? What is the hardware configuration of each
>     machine (specifically, memory)? What are your JVM settings for your
>     Solr instances? Is anything else running on these nodes?
>
>     It would help to understand what's happening on your servers. "I'm
>     seeing page faults" doesn't really help us help you.
>
>     Thanks,
>     - -chris
>
>     > On Mon, Jan 7, 2019 at 8:35 AM Branham, Jeremy (Experis)
>     > <jbrcm@allstate.com> wrote:
>     >>
>     >> Does anyone know if it is typical behavior for a SOLR cluster to
>     >> have lots of page faults (50-100 per second) under heavy load?
>     >>
>     >> We are performing load testing on a cluster with 8 nodes, and my
>     >> performance engineer has brought this information to attention.
>     >>
>     >> I don’t know enough about memory management to say it is normal
>     >> or not.
>     >>
>     >>
>     >>
>     >> The performance doesn’t appear to be suffering, but I don’t want
>     >> to overlook a potential hazard.
>     >>
>     >>
>     >>
>     >> Thanks!
>     >>
>     >>
>     >>
>     >>
>     >>
>     >>
>     >>
>     >>
>     >>
>     >> Jeremy Branham
>     >>
>     >> jbrcm@allstate.com
>     >>
>     >> Allstate Insurance Company | UCV Technology Services |
>     >> Information Services Group
>     >>
>     >>
>     >
>     -----BEGIN PGP SIGNATURE-----
>     Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
>
>     iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlwzpYsACgkQHPApP6U8
>     pFgSHxAAgaXV5wkwV7Ru2QyhnvxUnIWY4Iom0IdZYrDuZBDxmFx9wzE7P33zmR3E
>     nrgZCqBtAMdxRSwG9BfyKircChZBssqtQpskw6mgJyzRyGvKVJjJ68r0vEio3Kjo
>     HjaJczBFWvdOKm42W1Li4SeymGyYXu/jmdkWLcIbEM4BgDQLf1HhSEphDeZzP4ST
>     GNDBrIA6XkUJwE1r58FUuj9l0XSKUAPLOPNAx1qGiAn4fKdbysVHvLcvJvJzC0pC
>     1kx000r+Mqdd61EzhM20ZDIvg2F3vgFgGCUtB31hIi18bfD8whoAafL2FSMkIccD
>     H7X09PpUK8qPM/oQgqCKTtfmVR3M2pi3CSxLFSQ1/QucnF2wxWknOOWUH1TMU/L2
>     KUQHS6GwuTk+R/8PxdBRsZI8ON3MVb690ECV4QplYlkrtygXrLRg2YOgifgAXsKL
>     5Kg2mrpKoxfNnDWaRksy4GUDTsSxbkd1rpnHJEZ8le26HXvz9wrug/FtNPzqP8S9
>     dan2gkgiSqOM9GKlKkA72ROyQDhZa5YiXfGNdRrmfkiQzlDBEcGpD8pg1GwskRJl
>     yidTBfvRSyCHsI5NBGf65nTG+2WfUnr8wClHVK5QQGVilHBn6KzeHeDTL9ZpHvcn
>     GhkDMvc+9f8DR7Hr/mTiGjYIAvJZYiIJeYUoe0Bl2BHmGDv0tEk=
>     =OpZo
>     -----END PGP SIGNATURE-----
>
>

Mime
View raw message