carbondata-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacky Li <jacky.li...@qq.com>
Subject Improving show segment info
Date Sun, 16 Feb 2020 08:06:13 GMT
Hi community,

Currently for SHOW SEGMENT command, carbon will print:

+---+---------+-----------------------+-----------------------+---------+-----------+---------+----------+----+
|SegmentSequenceId   |Status   |Load Start Time        |Load End Time          |Merged To|File
Format|Data Size|Index Size|Path|
+---+---------+-----------------------+-----------------------+---------+-----------+---------+----------+----+
|5  |Compacted|2020-02-15 22:19:06.327|2020-02-15 22:19:06.516|4.1      |COLUMNAR_V3|630.0B
  |421.0B    |NA  |
|4.1|Success  |2020-02-15 22:19:06.327|2020-02-15 22:19:06.659|NA       |COLUMNAR_V3|640.0B
  |425.0B    |NA  |
|4  |Compacted|2020-02-15 22:19:05.894|2020-02-15 22:19:06.188|4.1      |COLUMNAR_V3|630.0B
  |421.0B    |NA  |
|3  |Compacted|2020-02-15 22:19:05.151|2020-02-15 22:19:05.391|2.1      |COLUMNAR_V3|630.0B
  |421.0B    |NA  |
|2.1|Compacted|2020-02-15 22:19:05.151|2020-02-15 22:19:05.592|0.2      |COLUMNAR_V3|640.0B
  |425.0B    |NA  |
|2  |Compacted|2020-02-15 22:19:04.804|2020-02-15 22:19:05.023|2.1      |COLUMNAR_V3|630.0B
  |421.0B    |NA  |
|1  |Compacted|2020-02-15 22:19:04.006|2020-02-15 22:19:04.242|0.1      |COLUMNAR_V3|630.0B
  |421.0B    |NA  |
|0.2|Success  |2020-02-15 22:19:05.151|2020-02-15 22:19:05.79 |NA       |COLUMNAR_V3|658.0B
  |425.0B    |NA  |
|0.1|Compacted|2020-02-15 22:19:04.006|2020-02-15 22:19:04.704|0.2      |COLUMNAR_V3|640.0B
  |425.0B    |NA  |
|0  |Compacted|2020-02-15 22:19:02.335|2020-02-15 22:19:03.775|0.1      |COLUMNAR_V3|630.0B
  |421.0B    |NA  |
+---+---------+-----------------------+-----------------------+---------+-----------+---------+----------+——+

While it prints basic info of segments, it does not support following usecase:

UserCase 1. Show for partition table
For partition table, have no way to know about the which partition that each segment mapped
to. The partition-segment mapping is two way mapping, user may want to check which segments
is belong to certain partition.

==> Suggest to add partition information for each segment when showing segments


UserCase 2. Show load delay
Have no way to know about the event time in the segment. By event time, I mean the timestamp
column value in the record being loaded. In many cases, user wanted to know the load delay
which indicate whether the system loading is catching up with the incoming data.

==> For this requirement, one suggestion is that we can show the minmax of the event time
column in SHOW SEGMENTS command, then user can know about the difference of the Load Start
Time and Event Time in the record. This solution depends on Segment level minmax (PR3584),
and we may need user to specify the event_time column name in table property when creating
table or use alter table to add it for existing table.


UserCase 3. Show load speed
Currenlty user need to calculate the load speed by Load Start Time and Load End Time, it is
not so convenient for user

==> We can keep the Load Start Time and calculate the elapsed time and throughput(in MBps)
by SHOW SEGMENTS automatically.


UserCase 4. Show more information and give more control to user, like filtering and sorting
Besides above information, actually there is more information kept in segment metadata but
have not shown to the user. And sometimes user wanted to sort the output by certain field,
like sort by load time

==> To show more detail info of the segment, I can think of two ways: 
    1) add SHOW SEGMENTS EXTENDED option
    2) To avoid making the output table too wide to read, I prefer to add DESC SEGMENTS command
to show detail info for one segment by specifying the segment id by user.

And, to give more control for user, we can add some options for sorting and filter. 


In summary, to support all above use cases, I suggest to change the command like:

Example 1:
// show basic info
SHOW SEGMENTS ON table1
will output
+---+---------+-----------------------+-----------------------+------+-----------+
|ID |Status   |Load Start Time        |Event Time Min         |Spent |Throughput |
+---+---------+-----------------------+-----------------------+------+-----------+
|2  |Success  |2020-02-15 22:10:00.327|2020-02-15 21:06:00.327|3m    |28MBps     |
|1  |Success  |2020-02-15 22:15:08.283|2020-02-15 22:11:06.432|28s   |30MBps     |
|0  |Success  |2020-02-15 22:20:25.894|2020-02-15 22:17:06.621|4m6s  |25MBps     |
+---+---------+-----------------------+-----------------------+------+-----------+

this is short enough so that it can fit in most of the console line width

Example 2:
// To show partition for each segment
SHOW SEGMENTS ON table1 PARTITION
output:
+---+---------+-----------------------+-----------------------+--------------------+
|ID |Status   |Load Start Time        |Event Time Min         |PARTITION           |
+---+---------+-----------------------+-----------------------+--------------------+
|2  |Success  |2020-02-15 22:10:00.327|2020-02-15 21:06:00.327|dt=29,planet=earth  |
|1  |Success  |2020-02-15 22:15:08.283|2020-02-15 22:11:06.432|dt=30,planet=earth  |
|0  |Success  |2020-02-15 22:20:25.894|2020-02-15 22:17:06.621|dt=34,planet=mars   |
+---+---------+-----------------------+-----------------------+--------------------+


Example 3:
// To filter out one partition
SHOW SEGMENTS ON table1 OPTOINS (‘partition’=‘dt=30,planet=earth’)

+---+---------+-----------------------+-----------------------+--------------------+
|ID |Status   |Load Start Time        |Event Time Min         |PARTITION           |
+---+---------+-----------------------+-----------------------+--------------------+
|1  |Success  |2020-02-15 22:15:08.283|2020-02-15 22:11:06.432|dt=30,planet=earth  |
+---+---------+-----------------------+-----------------------+——————————+


Example 4:
// To sort by load start time
SHOW SEGMENTS ON table1 OPTOINS (’sort_by’=‘load_start_time')

+---+---------+-----------------------+-----------------------+--------------------+
|ID |Status   |Load Start Time        |Event Time Min         |PARTITION           |
+---+---------+-----------------------+-----------------------+--------------------+
|0  |Success  |2020-02-15 22:20:25.894|2020-02-15 22:17:06.621|dt=34,planet=mars   |
|1  |Success  |2020-02-15 22:15:08.283|2020-02-15 22:11:06.432|dt=30,planet=earth  |
|2  |Success  |2020-02-15 22:10:00.327|2020-02-15 21:06:00.327|dt=29,planet=earth  |
+---+---------+-----------------------+-----------------------+--------------------+

The sort field can be: id, load_start_time, event_time_min, event_time_max, data_size, index_size,
and column_name_min or column_name_max, where column_name can be any literal string and exist
in the table schema.


Example 5:
// To show more detail info
DESC SEGMENT 2 ON table1

This will show all information for segment id 2.  Information including not only table status
file but also info in segment file.


I hope this suggestion can make segment info more visible to the user.
What do you think?


Regards,
Jacky
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message