kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Quanlong Huang" <huang_quanl...@126.com>
Subject Re:Re: Re: Re: Why RowSet size is much smaller than flush_threshold_mb
Date Thu, 02 Aug 2018 11:16:40 GMT
No, I failed to tune other flags... That's why I started this thread...

I understand it's a trade-off whether to expose the design docs. Not exposing them will make
the document clearer. The downside is users may bother you guys more when they encounter problems
since there're no answers they can find themselves. However, it's not a problem since you
guys are quite helpful :)


At 2018-08-02 10:18:00,"Todd Lipcon" <todd@cloudera.com> wrote:

On Wed, Aug 1, 2018 at 4:52 PM, Quanlong Huang <huang_quanlong@126.com> wrote:

In my experience, when I found the performance is below my expectation, I'd like to tune flags
listed in https://kudu.apache.org/docs/configuration_reference.html , which needs a clear
understanding of kudu internals. Maybe we can add the link there?

Any particular flags that you found you had to tune? I almost never advise tuning anything
other than the number of maintenance threads. If you have some good guidance on how tuning
those flags can improve performance, maybe we can consider changing the defaults or giving
some more prescriptive advice?

I'm a little nervous that saying "here are all the internals, and here are 100 config flags
to study" will scare users more than help them :)


At 2018-08-02 01:06:40,"Todd Lipcon" <todd@cloudera.com> wrote:

On Wed, Aug 1, 2018 at 6:28 AM, Quanlong Huang <huang_quanlong@126.com> wrote:

Hi Todd and William,

I'm really appreciated for your help and sorry for my late reply. I was going to reply with
some follow-up questions but was assigned to focus some other works... Now I'm back to this

The design docs are really helpful. Now I understand the flush and compaction. I think we
can add a link to these design docs in the kudu documentation page, so users who want to dig
deeper can know more about kudu internal.

Personally, since starting the project, I have had the philosophy that the user-facing documentation
should remain simple and not discuss internals too much. I found in some other open source
projects that there isn't a clear difference between user documentation and developer documentation,
and users can easily get confused by all of the internal details. Or, users may start to believe
that Kudu is very complex and they need to understand knapsack problem approximation algorithms
in order to operate it. So, normally we try to avoid exposing too much of the details.

That said, I think it is a good idea to add a small note in the documentation somewhere that
links to the design docs, maybe with some sentence explaining that understanding internals
is not necessary to operate Kudu, but that expert users may find the internal design useful
as a reference? I would be curious to hear what other users think about how best to make this

At 2018-06-15 23:41:17, "Todd Lipcon" <todd@cloudera.com> wrote:

Also, keep in mind that when the MRS flushes, it flushes into a bunch of separate RowSets,
not 1:1. It "rolls" to a new RowSet every N MB (N=32 by default). This is set by --budgeted_compaction_target_rowset_size

However, increasing this size isn't likely to decrease the number of compactions, because
each of these 32MB rowsets is non-overlapping. In other words, if your MRS contains rows A-Z,
the output RowSets will include [A-C], [D-G], [H-P], [Q-Z]. Since these ranges do not overlap,
they will never need to be compacted with each other. The net result, here, is that compaction
becomes more fine-grained and only needs to operate on sub-ranges of the tablet where there
is a lot of overlap.

You can read more about this in docs/design-docs/compaction-policy.md, in particular the section
"Limiting RowSet Sizes"

Hope that helps

On Fri, Jun 15, 2018 at 8:26 AM, William Berkeley <wdberkeley@gmail.com> wrote:

The op seen in the logs is a rowset compaction, which takes existing diskrowsets and rewrites
them. It's not a flush, which writes data in memory to disk, so I don't think the flush_threshold_mb
is relevant. Rowset compaction is done to reduce the amount of overlap of rowsets in primary
key space, i.e. reduce the number of rowsets that might need to be checked to enforce the
primary key constraint or find a row. Having lots of rowset compaction indicates that rows
are being written in a somewhat random order w.r.t the primary key order. Kudu will perform
much better as writes scale when rows are inserted roughly in increasing order per tablet.

Also, because you are using the log block manager (the default and only one suitable for production
deployments), there isn't a 1-1 relationship between cfiles or diskrowsets and files on the
filesystem. Many cfiles and diskrowsets will be put together in a container file.

Config parameters that might be relevant here:
--fs_data_dirs (how many)
--fs_wal_dir (is it shared on a device with the data dir?)

The metrics from the compact row sets op indicates the time is spent in fdatasync and in reading
(likely reading the original rowsets). The overall compaction time is kinda long but not crazy
long. What's the performance you are seeing and what is the performance you would like to


On Fri, Jun 15, 2018 at 7:52 AM, Quanlong Huang <huang_quanlong@126.com> wrote:

Hi all,

I'm running kudu 1.6.0-cdh5.14.2. When looking into the logs of tablet server, I find most
of the compactions are compacting small files (~40MB for each). For example:

I0615 07:22:42.63735130614tablet.cc:1661] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
Compaction: stage 1 complete, picked 4 rowsets to compact
I0615 07:22:42.63738530614compaction.cc:903] Selected 4 rowsets to compact:
I0615 07:22:42.63739330614compaction.cc:906] RowSet(343)(current size on disk: ~40666600 bytes)
I0615 07:22:42.63740130614compaction.cc:906] RowSet(1563)(current size on disk: ~34720852
I0615 07:22:42.63740830614compaction.cc:906] RowSet(1645)(current size on disk: ~29914833
I0615 07:22:42.63741530614compaction.cc:906] RowSet(1870)(current size on disk: ~29007249
I0615 07:22:42.63742830614tablet.cc:1447] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
Compaction: entering phase 1 (flushing snapshot). Phase 1 snapshot: MvccSnapshot[committed={T|T
< 6263071556616208384 or (T in {6263071556616208384})}]
I0615 07:22:42.64158230614multi_column_writer.cc:103] Opened CFile writers for 124 column(s)
I0615 07:22:43.87539630614multi_column_writer.cc:103] Opened CFile writers for 124 column(s)
I0615 07:22:44.41842130614multi_column_writer.cc:103] Opened CFile writers for 124 column(s)
I0615 07:22:45.11438930614multi_column_writer.cc:103] Opened CFile writers for 124 column(s)
I0615 07:22:54.76256330614tablet.cc:1532] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
Compaction: entering phase 2 (starting to duplicate updates in new rowsets)
I0615 07:22:54.77357230614tablet.cc:1587] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
Compaction Phase 2: carrying over any updates which arrived during Phase 1
I0615 07:22:54.77359930614tablet.cc:1589] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
Phase 2 snapshot: MvccSnapshot[committed={T|T < 6263071556616208384 or (T in {6263071556616208384})}]
I0615 07:22:55.18975730614tablet.cc:1631] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
Compaction successful on 82987 rows (123387929 bytes)
I0615 07:22:55.19142630614maintenance_manager.cc:491] Time spent running CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba):
real 12.628suser 1.460ssys 0.410s
I0615 07:22:55.19148430614maintenance_manager.cc:497] P 70f3e54fe0f3490cbf0371a6830a33a7:
CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba) metrics: {"cfile_cache_hit":812,"cfile_cache_hit_bytes":16840376,"cfile_cache_miss":2730,"cfile_cache_miss_bytes":251298442,"cfile_init":496,"data
dirs.queue_time_us":6646,"data dirs.run_cpu_time_us":2188,"data dirs.run_wall_time_us":101717,"fdatasync":315,"fdatasync_us":9617174,"lbm_read_time_us":1288971,"lbm_reads_1-10_ms":32,"lbm_reads_10-100_ms":41,"lbm_reads_lt_1ms":4641,"lbm_write_time_us":122520,"lbm_writes_lt_1ms":2799,"mutex_wait_us":25,"spinlock_wait_cycles":155264,"tcmalloc_contention_cycles":768,"thread_start_us":677,"threads_started":14,"wal-append.queue_time_us":300}

The flush_threshold_mb is set in the default value (1024). Wouldn't the flushed file size
be ~1GB?

I think increasing the initial RowSet size can reduce compactions and then reduce the impact
of other ongoing operations. It may also improve the flush performance. Is that right? If
so, how can I increase the RowSet size?

I'd be grateful if someone can make me clear about these!



Todd Lipcon
Software Engineer, Cloudera


Todd Lipcon
Software Engineer, Cloudera


Todd Lipcon
Software Engineer, Cloudera
View raw message