activemq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yw yw <wy96...@gmail.com>
Subject Re: Improve paging performance when there are lots of subscribers
Date Wed, 17 Jul 2019 02:40:38 GMT
I did consider the case where all pages are instantiated as PageReaders.
That's really a problem.

The pros of pr is every page is read only once to build PageReader and
shared by all the queues. The cons of pr is many PageReaders are probably
instantiated if consumers make slow/no progress in several queues whereas
fast in other queues(I think it's the only cause leading to the corner
case, right?). This means too many open files and too much memory.

The pros of duplicated PageReader is there are fixed number of PageReaders
as with queues at the same time.
The cons is each queue has to read the page once to build their own
PageReader if page cache is evicted. I'm not sure how this will affect
performance.

The point is we need the number of messages in the page which is used by
PageCursorInfo and PageSubscription::internalGetNext, so we have to read
the page file. How about we only cache the number of messages in each page
instead of PageReader and build PageReader in each queue. While we
encounter the corner case, only <long, int> pair data is permanently in
memory that I assume is smaller than completed PageCursorInfo data. This
way we achieve the performance gain at a small price.

Clebert Suconic <clebert.suconic@gmail.com> 于2019年7月16日周二 下午10:18写道:

> I just came back after a 2 weeks deserved break and I was looking at
> this.. and I can say. it's well done.. nice job! it's a lot simpler!
>
> However there's one question now. which is probably a further
> improvement. Shouldn't the pageReader be instantiated at the
> PageSubscription.
>
> That means.. if there's no page cache, in case of the page been
> evicted, the Subscription would then create a new Page/PageReader
> pair. and dispose it when it's done (meaning, moved to a different
> page).
>
> As you are solving the case with many subscriptions, wouldn't you hit
> a corner case where all Pages are instantiated as PageReaders?
>
>
> I feel like it would be better to eventually duplicate a PageReader
> and close it when done.
>
>
> Or did you already consider that possibility and still think it's best
> to keep this cache of PageReaders?
>
> On Sat, Jul 13, 2019 at 12:15 AM <michael.andre.pearce@me.com.invalid>
> wrote:
> >
> > Could a squashed PR be sent?
> >
> >
> >
> >
> > Get Outlook for Android
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Jul 12, 2019 at 2:23 PM +0100, "yw yw" <wy96fyw@gmail.com>
> wrote:
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > Hi,
> >
> > I have finished work on the new implementation(not yet tests and
> > configuration) as suggested by franz.
> >
> > I put fileOffsetset in the PagePosition and add a new class PageReader
> > which is a wrapper of the page that implements PageCache interface. The
> > PageReader class is used to read page file if cache is evicted. For
> detail,
> > see
> >
> https://github.com/wy96f/activemq-artemis/commit/3f388c2324738f01f53ce806b813220d28d40987
> >
> > I deployed some tests and results below:
> > 1. Running in 51MB size page and 1 page cache in the case of 100
> multicast
> > queues.
> > https://filebin.net/wnyan7d2n1qgfsvg
> > 2. Running in 5MB size page and 100 page cache in the case of 100
> multicast
> > queues.
> > https://filebin.net/re0989vz7ib1c5mc
> > 3. Running in 51MB size page and 1 page cache in the case of 1 queue.
> > https://filebin.net/3qndct7f11qckrus
> >
> > The results seem good, similar with the implementation in the pr. The
> most
> > important is the index cache data is removed, no worry about extra
> overhead
> > :)
> >
> > yw yw  于2019年7月4日周四 下午5:38写道:
> >
> > > Hi,  michael
> > >
> > > Thanks for the advise. For the current pr, we can use two arrays where
> one
> > > records the message number and the other one corresponding offset to
> > > optimize the memory usage. For the franz's approch, we will also work
> on
> > > a early prototyping implementation. After that, we would take some
> basic
> > > tests in different scenarios.
> > >
> > >  于2019年7月2日周二 上午7:08写道:
> > >
> > >> Point though is an extra index cache layer is needed. The overhead of
> > >> that means the total paged capacity will be more limited as that
> overhead
> > >> isnt just an extra int per reference. E.g. in the pr the current impl
> isnt
> > >> very memory optimised, could an int array be used or at worst an open
> > >> primitive int int hashmap.
> > >>
> > >>
> > >>
> > >>
> > >> This is why i really prefer franz's approach.
> > >>
> > >>
> > >>
> > >>
> > >> Also what ever we do, we need the new behaviour configurable, so
> should a
> > >> use case not thought about they won't be impacted. E.g. the change
> should
> > >> not be a surprise, it should be something you toggle on.
> > >>
> > >>
> > >>
> > >>
> > >> Get Outlook for Android
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Mon, Jul 1, 2019 at 1:01 PM +0100, "yw yw"  wrote:
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> Hi,
> > >> We've took a test against your configuration:
> > >> 5Mb10010Mb.
> > >> The current code: 7000msg/s sent and 18000msg/s received.
> > >> Pr code:16000msg/s received and 8200msg/s sent.
> > >> Like you said, the performance boosts by using much smaller page file
> and
> > >> holding many more for current code.
> > >>
> > >> Not sure what implications would have using smaller page file, the
> > >> producer
> > >> performance may reduce since switching files is more frequent, number
> of
> > >> file handle would increase?
> > >>
> > >> While our consumer in the test just echos, nothing to do after
> receiving
> > >> message, the consumer in the real world may be busy doing business.
> This
> > >> means references and page caches reside in memory longer and may be
> > >> evicted
> > >> more easily when producers are sending all the time.
> > >>
> > >> Since We don't know how many subscribers there are, it is not a
> scalable
> > >> approch. We can't reduce page file size unlimited to fit the number of
> > >> subscribers. The code should accommodate to all kinds of
> configurations.
> > >> We
> > >> adjust configuration for trade off as needed, not work around IMO.
> > >> In our company, ~200 queues(60% are owned by some addresses) are
> deployed
> > >> in the broker. We can't set all to e.g. 100 page caches(too much
> memory),
> > >> and neither set different size according to address pattern(hard for
> > >> operation). In the multi tenants cluster, we prefer availability and
> to
> > >> avoid memory exhausted, we set pageSize to 30MB, max cache size to 1
> and
> > >> max size to 31MB. It's running well in one of our clusters now:)
> > >>
> > >>  于2019年6月29日周六 上午2:35写道:
> > >>
> > >> > I think some of that is down to configuration. If you think you
> could
> > >> > configure paging to have much smaller page files but have many more
> > >> held.
> > >> > That way the reference sizes will be far smaller and pages dropping
> in
> > >> and
> > >> > out would be less. E.g. if you expect 100 being read make it 100 but
> > >> make
> > >> > the page sizes smaller so the overhead is far less
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > Get Outlook for Android
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Thu, Jun 27, 2019 at 11:10 AM +0100, "yw yw"  wrote:
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > "At last for one message we maybe read twice: first we read page and
> > >> create
> > >> > pagereference; second we requery message after its reference is
> > >> removed.  "
> > >> >
> > >> > I just realized it was wrong. One message maybe read many times.
> Think
> > >> of
> > >> > this: When #1~#2000 msg is delivered, need to depage #2001-#4000
> msg,
> > >> > reading the whole page; When #2001~#4000 msg is deliverd, need to
> depage
> > >> > #4001~#6000 msg, reading page again, etc.
> > >> >
> > >> > One message maybe read three times if we don't depage until all
> messages
> > >> > are delivered. For example, we have 3 pages p1, p2,p3 and message
m1
> > >> which
> > >> > is at top part of the p2. In our case(max-size-bytes=51MB, a little
> > >> bigger
> > >> > than page size), first depage round reads bottom half of p1 and top
> > >> part of
> > >> > p2; second depage round reads bottom half of p2 and top part of p3.
> > >> > Therforce p2 is read twice and m1 maybe read three times if
> requeryed.
> > >> >
> > >> > Be honest, i don't know how to fix the problem above with the
> > >> > decrentralized approch. The point is not how we rely on os cache,
> it's
> > >> that
> > >> > we do it the wrong way, shouldn't read whole page(50MB) just for
> ~2000
> > >> > messages. Also there is no need to save 51MB PagedReferenceImpl in
> > >> memory.
> > >> > When 100 queues occupy 5100MB memory, the message references are
> very
> > >> > likely to be removed.
> > >> >
> > >> >
> > >> > Francesco Nigro  于2019年6月27日周四 下午5:05写道:
> > >> >
> > >> > > >
> > >> > > >  which means the offset info is 100 times large compared
to the
> > >> shared
> > >> > > > page index cache.
> > >> > >
> > >> > >
> > >> > > I would check with JOL plugin for exact numbers..
> > >> > > I see with it that we would have an increase of 4 bytes for each
> > >> > > PagedRefeferenceImpl, totally decrentralized vs
> > >> > > a centralized approach (the cache). In the economy of a fully
> loaded
> > >> > > broker, if we care about scaling need to understand if the memory
> > >> > tradeoff
> > >> > > is important enough
> > >> > > to choose one of the 2 approaches.
> > >> > > My point is that paging could be made totally based on the OS
page
> > >> cache
> > >> > if
> > >> > > GC would get in the middle, deleting any previous mechanism of
> page
> > >> > > caching...simplifying the process at it is.
> > >> > > Using a 2 level cache with such centralized approach can work,
but
> > >> will
> > >> > add
> > >> > > a level of complexity that IMO could be saved...
> > >> > > What do you think could be the benefit of the decentralized
> solution
> > >> if
> > >> > > compared with the one proposed in the PR?
> > >> > >
> > >> > >
> > >> > > Il giorno gio 27 giu 2019 alle ore 10:41 yw yw  ha
> > >> > > scritto:
> > >> > >
> > >> > > > Sorry, I missed the PageReferece part.
> > >> > > >
> > >> > > > The lifecyle of PageReference is: depage(in
> > >> > > intermediateMessageReferences)
> > >> > > > -> deliver(in messageReferences) -> waiting for ack(in
> > >> deliveringRefs)
> > >> > ->
> > >> > > > removed. Every queue would create it's own PageReference
which
> means
> > >> > the
> > >> > > > offset info is 100 times large compared to the shared page
index
> > >> cache.
> > >> > > > If we keep 51MB pageReference size in memory, as i said
in pr,
> "For
> > >> > > > multiple subscribers to the same address, just one executor
is
> > >> > > responsible
> > >> > > > for delivering which means at the same moment only one queue
is
> > >> > > delivering.
> > >> > > > Thus the queue maybe stalled for a long time. We get
> queueMemorySize
> > >> > > > messages into memory, and when we deliver these after a
long
> time,
> > >> we
> > >> > > > probably need to query message and read page file again.".
 At
> last
> > >> for
> > >> > > one
> > >> > > > message we maybe read twice: first we read page and create
> > >> > pagereference;
> > >> > > > second we requery message after its reference is removed.
> > >> > > >
> > >> > > > For the shared page index cache design, each message just
need
> to be
> > >> > read
> > >> > > > from file once.
> > >> > > >
> > >> > > > Michael Pearce  于2019年6月27日周四 下午3:03写道:
> > >> > > >
> > >> > > > > Hi
> > >> > > > >
> > >> > > > > First of all i think this is an excellent effort, and
could
> be a
> > >> > > > potential
> > >> > > > > massive positive change.
> > >> > > > >
> > >> > > > > Before making any change on such scale, i do think
we need to
> > >> ensure
> > >> > we
> > >> > > > > have sufficient benchmarks on a number of scenarios,
not just
> one
> > >> use
> > >> > > > case,
> > >> > > > > and the benchmark tool used does need to be available
openly
> so
> > >> that
> > >> > > > others
> > >> > > > > can verify the measures and check on their setups.
> > >> > > > >
> > >> > > > > Some additional scenarios i would want/need covering
are:
> > >> > > > >
> > >> > > > > PageCache set to 5, and all consumers keeping up, but
lagging
> > >> enough
> > >> > to
> > >> > > > be
> > >> > > > > reading from the same 1st page cache, latency and throughput
> need
> > >> to
> > >> > be
> > >> > > > > measured for all.
> > >> > > > > PageCache set to 5 and all consumers but one keeping
up but
> > >> lagging
> > >> > > > enough
> > >> > > > > to be reading from the same 1st page cahce, but the
one is
> falling
> > >> > off
> > >> > > > the
> > >> > > > > end, causing the page cache swapping, measure latecy
and
> > >> througput of
> > >> > > > those
> > >> > > > > keeping up in the 1st page cache not caring for the
one.
> > >> > > > >
> > >> > > > > Regards to solution some alternative approach to discuss
> > >> > > > >
> > >> > > > > In your scenario if i understand correctly each subscriber
is
> > >> > > effectivly
> > >> > > > > having their own queue (1 to 1 mapping) not sharing.
> > >> > > > > You mention kafka and say multiple consumers doent
read
> serailly
> > >> on
> > >> > the
> > >> > > > > address and this is true, but per queue processing
through
> > >> messages
> > >> > > > > (dispatch) is still serial even with multiple shared
> consumers on
> > >> a
> > >> > > > queue.
> > >> > > > >
> > >> > > > > What about keeping the existing mechanism but having
a queue
> hold
> > >> > > > reference
> > >> > > > > to a page cache that the queue is currently on, being
kept
> from gc
> > >> > > (e.g.
> > >> > > > > not soft) therefore meaning page cache isnt being swapped
> around,
> > >> > when
> > >> > > > you
> > >> > > > > have queues (in your case subscribers) swapping pagecaches
> back
> > >> and
> > >> > > forth
> > >> > > > > avoidning the constant re-read issue.
> > >> > > > >
> > >> > > > > Also i think Franz had an excellent idea, do away with
> pagecache
> > >> in
> > >> > its
> > >> > > > > current form entirely, ensure the offset is kept with
the
> > >> reference
> > >> > and
> > >> > > > > rely on OS caching keeping hot blocks/data.
> > >> > > > >
> > >> > > > > Best
> > >> > > > > Michael
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > On Thu, 27 Jun 2019 at 05:13, yw yw  wrote:
> > >> > > > >
> > >> > > > > > Hi, folks
> > >> > > > > >
> > >> > > > > > This is the discussion about "ARTEMIS-2399 Fix
performance
> > >> > > degradation
> > >> > > > > > when there are a lot of subscribers".
> > >> > > > > >
> > >> > > > > > First apologize i didn't clarify our thoughts.
> > >> > > > > >
> > >> > > > > > As noted in the part of Environment, page-max-cache-size
is
> set
> > >> to
> > >> > 1
> > >> > > > > > meaning at most one page is allowed in softValueCache.
We
> have
> > >> > tested
> > >> > > > > with
> > >> > > > > > the default page-max-cache-size which is 5, it
would take
> some
> > >> time
> > >> > > to
> > >> > > > > > see the performance degradation since at start
the cursor
> > >> positions
> > >> > > of
> > >> > > > > 100
> > >> > > > > > subscribers are similar when all the messages
read hits the
> > >> > > > > softValueCache.
> > >> > > > > > But after some time, the cursor positions are
different.
> When
> > >> these
> > >> > > > > > positions are located more than 5 pages, it means
some page
> > >> would
> > >> > be
> > >> > > > read
> > >> > > > > > back and forth. This can be proved by the trace
log "adding
> > >> > pageCache
> > >> > > > > > pageNr=xxx into cursor = test-topic" in
> PageCursorProviderImpl
> > >> > where
> > >> > > > some
> > >> > > > > > pages are read a lot of times for the same subscriber.
From
> the
> > >> > time
> > >> > > > on,
> > >> > > > > > the performance starts to degrade. So we set
> page-max-cache-size
> > >> > to 1
> > >> > > > > > here just to make the test process more fast and
it doesn't
> > >> change
> > >> > > the
> > >> > > > > > final result.
> > >> > > > > >
> > >> > > > > > The softValueCache would be removed if memory
is really
> low, in
> > >> > > > addition
> > >> > > > > > the map size reaches capacity(default 5). In most
cases, the
> > >> > > > subscribers
> > >> > > > > > are tailing read which are served by softValueCache(no
need
> to
> > >> > bother
> > >> > > > > > disk), thus we need to keep it. But When some
subscribers
> fall
> > >> > > behind,
> > >> > > > > they
> > >> > > > > > need to read page not in softValueCache. After
looking up
> code,
> > >> we
> > >> > > > found
> > >> > > > > one
> > >> > > > > > depage round is following at most MAX_SCHEDULED_RUNNERS
> deliver
> > >> > round
> > >> > > > in
> > >> > > > > > most situations, and that's to say at most
> > >> MAX_DELIVERIES_IN_LOOP *
> > >> > > > > > MAX_SCHEDULED_RUNNERS number of messages would
be depaged
> next.
> > >> If
> > >> > > you
> > >> > > > > > adjust QueueImpl logger to debug level, you would
see logs
> like
> > >> > > "Queue
> > >> > > > > > Memory Size after depage on queue=sub4 is 53478769
with
> maxSize
> > >> =
> > >> > > > > 52428800.
> > >> > > > > > Depaged 68 messages, pendingDelivery=1002,
> > >> > > > intermediateMessageReferences=
> > >> > > > > > 23162, queueDelivering=0". In order to depage
less than 2000
> > >> > > messages,
> > >> > > > > > each subscriber has to read a whole page which
is
> unnecessary
> > >> and
> > >> > > > > wasteful.
> > >> > > > > > In our test where one page(50MB) contains ~40000
messages,
> one
> > >> > > > subscriber
> > >> > > > > > maybe read 40000/2000=20 times of page if softValueCache
is
> > >> evicted
> > >> > > to
> > >> > > > > > finish delivering it. This has drastically slowed
down the
> > >> process
> > >> > > and
> > >> > > > > > burdened on the disk. So we add the PageIndexCacheImpl
and
> read
> > >> one
> > >> > > > > message
> > >> > > > > > each time rather than read all messages of page.
In this
> way,
> > >> for
> > >> > > each
> > >> > > > > > subscriber each page is read only once after finishing
> > >> delivering.
> > >> > > > > >
> > >> > > > > > Having said that, the softValueCache is used for
tailing
> read.
> > >> If
> > >> > > it's
> > >> > > > > > evicted, it won't be reloaded to prevent from
the issue
> > >> illustrated
> > >> > > > > above.
> > >> > > > > > Instead the pageIndexCache would be used.
> > >> > > > > >
> > >> > > > > > Regarding implementation details, we noted that
before
> > >> delivering
> > >> > > > page, a
> > >> > > > > > pageCursorInfo is constructed which needs to read
the whole
> > >> page.
> > >> > We
> > >> > > > can
> > >> > > > > > take this opportunity to construct the pageIndexCache.
It's
> very
> > >> > > simple
> > >> > > > > to
> > >> > > > > > code. We also think of building a offset index
file and some
> > >> > concerns
> > >> > > > > > stemed from following:
> > >> > > > > >
> > >> > > > > >    1. When to write and sync index file? Would
it have some
> > >> > > performance
> > >> > > > > >    implications?
> > >> > > > > >    2. If we have a index file, we can construct
> pageCursorInfo
> > >> > > through
> > >> > > > > >    it(no need to read the page like before), but
we need to
> > >> write
> > >> > the
> > >> > > > > total
> > >> > > > > >    message number into it first. Seems a little
weird
> putting
> > >> this
> > >> > > into
> > >> > > > > the
> > >> > > > > >    index file.
> > >> > > > > >    3. If experiencing hard crash, a recover mechanism
would
> be
> > >> > needed
> > >> > > > to
> > >> > > > > >    recover page and page index files, E.g. truncating
to the
> > >> valid
> > >> > > > size.
> > >> > > > > So
> > >> > > > > >    how do we know which files need to be sanity
checked?
> > >> > > > > >    4. A variant binary search algorithm maybe
needed, see
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala
> > >> > > > > >     .
> > >> > > > > >    5. Unlike kafka from which user fetches lots
of messages
> at
> > >> once
> > >> > > and
> > >> > > > > >    broker just needs to look up start offset from
the index
> file
> > >> > > once,
> > >> > > > > artemis
> > >> > > > > >    delivers message one by one and that means
we have to
> look up
> > >> > the
> > >> > > > > index
> > >> > > > > >    every time we deliver a message. Although the
index file
> is
> > >> > > possibly
> > >> > > > > in
> > >> > > > > >    page cache, there are still chances we miss
cache.
> > >> > > > > >    6. Compatibility with old files.
> > >> > > > > >
> > >> > > > > > To sum that, kafka uses a mmaped index file and
we use a
> index
> > >> > cache.
> > >> > > > > Both
> > >> > > > > > are designed to find physical file position according
> > >> offset(kafka)
> > >> > > or
> > >> > > > > > message number(artemis). And we prefer the index
cache bcs
> it's
> > >> > easy
> > >> > > to
> > >> > > > > > understand and maintain.
> > >> > > > > >
> > >> > > > > > We also tested the one subscriber case with the
same setup.
> > >> > > > > > The original:
> > >> > > > > > consumer tps(11000msg/s) and latency:
> > >> > > > > > [image: orig_single_subscriber.png]
> > >> > > > > > producer tps(30000msg/s) and latency:
> > >> > > > > > [image: orig_single_producer.png]
> > >> > > > > > The pr:
> > >> > > > > > consumer tps(14000msg/s) and latency:
> > >> > > > > > [image: pr_single_consumer.png]
> > >> > > > > > producer tps(30000msg/s) and latency:
> > >> > > > > > [image: pr_single_producer.png]
> > >> > > > > > It showed result is similar and event a little
better in the
> > >> case
> > >> > of
> > >> > > > > > single subscriber.
> > >> > > > > >
> > >> > > > > > We used our inner test platform and i think jmeter
can also
> be
> > >> used
> > >> > > to
> > >> > > > > > test again it.
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> >
> >
> >
> >
> >
>
>
> --
> Clebert Suconic
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message