Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Tue, 18 Jul 2017 22:15:00 +0000 (UTC)
From: "Enis Soztutar (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.13073570.1495228737000.276447.1500416100679@Atlassian.JIRA>
In-Reply-To: <JIRA.13073570.1495228737000@Atlassian.JIRA>
References: <JIRA.13073570.1495228737000@Atlassian.JIRA> <JIRA.13073570.1495228737833@jira-lw-us.apache.org>
Subject: [jira] [Commented] (HBASE-18086) Create native client which creates
 load on selected cluster
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Tue, 18 Jul 2017 22:15:08 -0000


    [ https://issues.apache.org/jira/browse/HBASE-18086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092265#comment-16092265 ] 

Enis Soztutar commented on HBASE-18086:
---------------------------------------

bq. Updated patch v12 where random number generation is lifted outside the loop (it was observed that write performance suffered with random number generation inside the loop).
It does not make sense to me that random number generation is costly. I've looked at the folly code, there is nothing explaining it. Can you please verify the total number of columns written in each case. You can also test with just generating 1M or so random numbers in a loop and measure the total time it takes end to end. We want each row to come with a different number of columns. 

- No use of {{new}} or {{delete}}. Always use smart pointers. 
{code}
+    std::thread *writer_threads = new std::thread[FLAGS_threads];
{code}

- These flags should have the same names as the ones in simple-client.cc: 
{code}
+DEFINE_int32(multi_get_size, 1, "number of gets in one multi-get");
+DEFINE_bool(skip_get, false, "skip get / scan");
+DEFINE_bool(skip_put, false, "skip put's");
{code} 
there is also report_num_rows, scans and multigets and conf flags that you should implement.

- These should be return values instead of passing pointer to the methods: 
{code}
bool *succeeded
{code}

- Instead of executing every Cell as a different Put via Table::Put(), you should construct one Put object, add all the Cells, then call Table::Put() 
{code}
for (uint64_t j = 0; j < rows; j++) {
+    std::string row = PrefixZero(width, iteration * rows + j);
+    for (auto family : families) {
+      table->Put(Put{row}.AddColumn(family, kNumColumn, std::to_string(n_cols)));
+      for (unsigned int k = 1; k <= n_cols; k++) {
+        table->Put(Put{row}.AddColumn(family, std::to_string(k), row));
+      }
+    }
{code}

- Instead of this method: 
{code}
+std::string PrefixZero(int total_width, int num) {
{code}
you can probably do something like this (from scanner-test.cc): 
{code}
std::string Row(uint32_t i, int width) {
  std::ostringstream s;
  s.fill('0');
  s.width(width);
  s << i;
  return "row" + s.str();
}
{code}

- Scans and gets should validate the obtained Result using the same logic, no? I think you should extract that into a function and use it from both. 
- The way we do multi-gets will result in all of the multi-get requests go to the same region. Instead, I think it is better to have the multi-gets scattered around most of the regions, so that we have a high likelihood of testing server failure handling, etc when chaos monkey is run with this. I had argued the same in my above comments. I think we can do something like a hash-like striping across the row key space among threads, rather than range-based striping. That should give us the ability to do multi-gets across all the regions in one {{Table::Get(std::vector)}} call. 
 - We don't have multi-put functionality right now, but when that is added, we should do a follow up patch for this to add multi-put functionality. 
- These should default to {{load_test_table}} and {{f}} respectively. 
{code}
+DEFINE_string(table, "t", "What table to do the reads and writes with");
+DEFINE_string(families, "d", "comma separated list of column family names");
{code}

> Create native client which creates load on selected cluster
> -----------------------------------------------------------
>
>                 Key: HBASE-18086
>                 URL: https://issues.apache.org/jira/browse/HBASE-18086
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>         Attachments: 18086.v11.txt, 18086.v12.txt, 18086.v14.txt, 18086.v1.txt, 18086.v3.txt, 18086.v4.txt, 18086.v5.txt, 18086.v6.txt, 18086.v7.txt, 18086.v8.txt
>
>
> This task is to create a client which uses multiple threads to conduct Puts followed by Gets against selected cluster.
> Default is to run the tool against local cluster.
> This would give us some idea on the characteristics of native client in terms of handling high load.


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)