kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Quanlong Huang" <huang_quanl...@126.com>
Subject Re:Re: Why RowSet size is much smaller than flush_threshold_mb
Date Wed, 01 Aug 2018 13:28:28 GMT
Hi Todd and William,


I'm really appreciated for your help and sorry for my late reply. I was going to reply with
some follow-up questions but was assigned to focus some other works... Now I'm back to this
work.


The design docs are really helpful. Now I understand the flush and compaction. I think we
can add a link to these design docs in the kudu documentation page, so users who want to dig
deeper can know more about kudu internal.


Thanks,
Quanlong

At 2018-06-15 23:41:17, "Todd Lipcon" <todd@cloudera.com> wrote:

Also, keep in mind that when the MRS flushes, it flushes into a bunch of separate RowSets,
not 1:1. It "rolls" to a new RowSet every N MB (N=32 by default). This is set by --budgeted_compaction_target_rowset_size


However, increasing this size isn't likely to decrease the number of compactions, because
each of these 32MB rowsets is non-overlapping. In other words, if your MRS contains rows A-Z,
the output RowSets will include [A-C], [D-G], [H-P], [Q-Z]. Since these ranges do not overlap,
they will never need to be compacted with each other. The net result, here, is that compaction
becomes more fine-grained and only needs to operate on sub-ranges of the tablet where there
is a lot of overlap.


You can read more about this in docs/design-docs/compaction-policy.md, in particular the section
"Limiting RowSet Sizes"


Hope that helps
-Todd


On Fri, Jun 15, 2018 at 8:26 AM, William Berkeley <wdberkeley@gmail.com> wrote:

The op seen in the logs is a rowset compaction, which takes existing diskrowsets and rewrites
them. It's not a flush, which writes data in memory to disk, so I don't think the flush_threshold_mb
is relevant. Rowset compaction is done to reduce the amount of overlap of rowsets in primary
key space, i.e. reduce the number of rowsets that might need to be checked to enforce the
primary key constraint or find a row. Having lots of rowset compaction indicates that rows
are being written in a somewhat random order w.r.t the primary key order. Kudu will perform
much better as writes scale when rows are inserted roughly in increasing order per tablet.


Also, because you are using the log block manager (the default and only one suitable for production
deployments), there isn't a 1-1 relationship between cfiles or diskrowsets and files on the
filesystem. Many cfiles and diskrowsets will be put together in a container file.


Config parameters that might be relevant here:
--maintenance_manager_num_threads
--fs_data_dirs (how many)
--fs_wal_dir (is it shared on a device with the data dir?)


The metrics from the compact row sets op indicates the time is spent in fdatasync and in reading
(likely reading the original rowsets). The overall compaction time is kinda long but not crazy
long. What's the performance you are seeing and what is the performance you would like to
see?


-Will


On Fri, Jun 15, 2018 at 7:52 AM, Quanlong Huang <huang_quanlong@126.com> wrote:

Hi all,


I'm running kudu 1.6.0-cdh5.14.2. When looking into the logs of tablet server, I find most
of the compactions are compacting small files (~40MB for each). For example:


I0615 07:22:42.63735130614tablet.cc:1661] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
Compaction: stage 1 complete, picked 4 rowsets to compact
I0615 07:22:42.63738530614compaction.cc:903] Selected 4 rowsets to compact:
I0615 07:22:42.63739330614compaction.cc:906] RowSet(343)(current size on disk: ~40666600 bytes)
I0615 07:22:42.63740130614compaction.cc:906] RowSet(1563)(current size on disk: ~34720852
bytes)
I0615 07:22:42.63740830614compaction.cc:906] RowSet(1645)(current size on disk: ~29914833
bytes)
I0615 07:22:42.63741530614compaction.cc:906] RowSet(1870)(current size on disk: ~29007249
bytes)
I0615 07:22:42.63742830614tablet.cc:1447] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
Compaction: entering phase 1 (flushing snapshot). Phase 1 snapshot: MvccSnapshot[committed={T|T
< 6263071556616208384 or (T in {6263071556616208384})}]
I0615 07:22:42.64158230614multi_column_writer.cc:103] Opened CFile writers for 124 column(s)
I0615 07:22:43.87539630614multi_column_writer.cc:103] Opened CFile writers for 124 column(s)
I0615 07:22:44.41842130614multi_column_writer.cc:103] Opened CFile writers for 124 column(s)
I0615 07:22:45.11438930614multi_column_writer.cc:103] Opened CFile writers for 124 column(s)
I0615 07:22:54.76256330614tablet.cc:1532] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
Compaction: entering phase 2 (starting to duplicate updates in new rowsets)
I0615 07:22:54.77357230614tablet.cc:1587] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
Compaction Phase 2: carrying over any updates which arrived during Phase 1
I0615 07:22:54.77359930614tablet.cc:1589] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
Phase 2 snapshot: MvccSnapshot[committed={T|T < 6263071556616208384 or (T in {6263071556616208384})}]
I0615 07:22:55.18975730614tablet.cc:1631] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
Compaction successful on 82987 rows (123387929 bytes)
I0615 07:22:55.19142630614maintenance_manager.cc:491] Time spent running CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba):
real 12.628suser 1.460ssys 0.410s
I0615 07:22:55.19148430614maintenance_manager.cc:497] P 70f3e54fe0f3490cbf0371a6830a33a7:
CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba) metrics: {"cfile_cache_hit":812,"cfile_cache_hit_bytes":16840376,"cfile_cache_miss":2730,"cfile_cache_miss_bytes":251298442,"cfile_init":496,"data
dirs.queue_time_us":6646,"data dirs.run_cpu_time_us":2188,"data dirs.run_wall_time_us":101717,"fdatasync":315,"fdatasync_us":9617174,"lbm_read_time_us":1288971,"lbm_reads_1-10_ms":32,"lbm_reads_10-100_ms":41,"lbm_reads_lt_1ms":4641,"lbm_write_time_us":122520,"lbm_writes_lt_1ms":2799,"mutex_wait_us":25,"spinlock_wait_cycles":155264,"tcmalloc_contention_cycles":768,"thread_start_us":677,"threads_started":14,"wal-append.queue_time_us":300}


The flush_threshold_mb is set in the default value (1024). Wouldn't the flushed file size
be ~1GB?


I think increasing the initial RowSet size can reduce compactions and then reduce the impact
of other ongoing operations. It may also improve the flush performance. Is that right? If
so, how can I increase the RowSet size?


I'd be grateful if someone can make me clear about these!


Thanks,
Quanlong







--

Todd Lipcon
Software Engineer, Cloudera
Mime
View raw message