lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Realtime get not always returning existing data
Date Mon, 01 Oct 2018 01:15:35 GMT
57 million queries later, with constant indexing going on and 9 dummy
collections in the mix and the main collection I'm querying having 2
shards, 2 replicas each, I have no errors.

So unless the code doesn't look like it exercises any similar path,
I'm not sure what more I can test. "It works on my machine" ;)

Here's my querying code, does it look like it what you're seeing?

      while (Main.allStop.get() == false) {
        try (SolrClient client = new HttpSolrClient.Builder()
//("http://my-solr-server:8981/solr/eoe_shard1_replica_n4")) {
            .withBaseSolrUrl("http://localhost:8981/solr/eoe").build()) {

          //SolrQuery query = new SolrQuery();
          String lower = Integer.toString(rand.nextInt(1_000_000));
          SolrDocument rsp = client.getById(lower);
          if (rsp == null) {
            System.out.println("Got a null response!");
            Main.allStop.set(true);
          }

          rsp = client.getById(lower);

          if (rsp.get("id").equals(lower) == false) {
            System.out.println("Got an invalid response, looking for "
+ lower + " got: " + rsp.get("id"));
            Main.allStop.set(true);
          }
          long queries = Main.eoeCounter.incrementAndGet();
          if ((queries % 100_000) == 0) {
            long seconds = (System.currentTimeMillis() - Main.start) / 1000;
            System.out.println("Query count: " +
numFormatter.format(queries) + ", rate is " +
numFormatter.format(queries / seconds) + " QPS");
          }
        } catch (Exception cle) {
          cle.printStackTrace();
          Main.allStop.set(true);
        }
      }
  }On Sat, Sep 29, 2018 at 12:46 PM Erick Erickson
<erickerickson@gmail.com> wrote:
>
> Steve:
>
> bq.  Basically, one core had data in it that should belong to another
> core. Here's my question about this: Is it possible that two request to the
> /get API coming in at the same time would get confused and either both get
> the same result or result get inverted?
>
> Well, that shouldn't be happening, these are all supposed to be thread-safe
> calls.... All things are possible of course ;)
>
> If two replicas of the same shard have different documents, that could account
> for what you're seeing, meanwhile begging the question of why that is the case
> since it should never be true for a quiescent index. Technically there _are_
> conditions where this is true on a very temporary basis, commits on the leader
> and follower can trigger at different wall-clock times. Say your soft commit
> (or hard-commit-with-opensearcher-true) is 10 seconds. It should never be the
> case that s1r1 and s1r2 are out of sync 10 seconds after the last update was
> sent. This doesn't seem likely from what you've described though...
>
> Hmmmm. I guess that one other thing I can set up is to have a bunch of dummy
> collections laying around. Currently I have only the active one, and
> if there's some
> code path whereby the RTG request goes to a replica of a different
> collection, my
> test setup wouldn't reproduce it.
>
> Currently, I'm running a 2-shard, 1 replica setup, so if there's some
> way that the replicas
> get out of sync that wouldn't show either.
>
> So I'm starting another run with these changes:
> > opening a new connection each query
> > switched so the collection I'm querying is 2x2
> > added some dummy collections that are empty
>
> One nit, while "core" is exactly correct. When we talk about a core
> that's part of a collection, we try to use "replica" to be clear we're
> talking about
> a core with some added characteristics, i.e. we're in SolrCloud-land.
> No big deal
> of course....
>
> Best,
> Erick
> On Sat, Sep 29, 2018 at 8:28 AM Shawn Heisey <apache@elyograg.org> wrote:
> >
> > On 9/28/2018 8:11 PM, sgaron cse wrote:
> > > @Shawn
> > > We're running two instance on one machine for two reason:
> > > 1. The box has plenty of resources (48 cores / 256GB ram) and since I was
> > > reading that it's not recommended to use more than 31GB of heap in SOLR we
> > > figured 96 GB for keeping index data in OS cache + 31 GB of heap per
> > > instance was a good idea.
> >
> > Do you know that these Solr instances actually DO need 31 GB of heap, or
> > are you following advice from somewhere, saying "use one quarter of your
> > memory as the heap size"?  That advice is not in the Solr documentation,
> > and never will be.  Figuring out the right heap size requires
> > experimentation.
> >
> > https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F
> >
> > How big (on disk) are each of these nine cores, and how many documents
> > are in each one?  Which of them is in each Solr instance?  With that
> > information, we can make a *guess* about how big your heap should be.
> > Figuring out whether the guess is correct generally requires careful
> > analysis of a GC log.
> >
> > > 2. We're in testing phase so we wanted a SOLR cloud configuration, we will
> > > most likely have a much bigger deployment once going to production. In prod
> > > right now, we currently to run a six machines Riak cluster. Riak is a
> > > key/value document store an has SOLR built-in for search, but we are trying
> > > to push the key/value aspect of Riak inside SOLR. That way we would have
> > > one less piece to worry about in our system.
> >
> > Solr is not a database.  It is not intended to be a data repository.
> > All of its optimizations (most of which are actually in Lucene) are
> > geared towards search.  While technically it can be a key-value store,
> > that is not what it was MADE for.  Software actually designed for that
> > role is going to be much better than Solr as a key-value store.
> >
> > > When I say null document, I mean the /get API returns: {doc: null}
> > >
> > > The problem is definitely not always there. We also have large period of
> > > time (few hours) were we have no problems. I'm just extremely hesitant on
> > > retrying when I get a null document because in some case, getting a null
> > > document is a valid outcome. Our caching layer heavily rely on this for
> > > example. If I was to retry every nulls I'd pay a big penalty in
> > > performance.
> >
> > I've just done a little test with the 7.5.0 techproducts example.  It
> > looks like returning doc:null actually is how the RTG handler says it
> > didn't find the document.  This seems very wrong to me, but I didn't
> > design it, and that response needs SOME kind of format.
> >
> > Have you done any testing to see whether the standard searching handler
> > (typically /select, but many other URL paths are possible) returns
> > results when RTG doesn't?  Do you know for these failures whether the
> > document has been committed or not?
> >
> > > As for your last comment, part of our testing phase is also testing the
> > > limits. Our framework has auto-scaling built-in so if we have a burst of
> > > request, the system will automatically spin up more clients. We're pushing
> > > 10% of our production system to that Test server to see how it will handle
> > > it.
> >
> > To spin up another replica, Solr must copy all its index data from the
> > leader replica.  Not only can this take a long time if the index is big,
> > but it will put a lot of extra I/O load on the machine(s) with the
> > leader roles.  So performance will actually be WORSE before it gets
> > better when you spin up another replica, and if the index is big, that
> > condition will persist for quite a while.  Copying the index data will
> > be constrained by the speed of your network and by the speed of your
> > disks.  Often the disks are slower than the network, but that is not
> > always the case.
> >
> > Thanks,
> > Shawn
> >

Mime
View raw message