Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D1B44200B7A for ; Mon, 5 Sep 2016 16:28:48 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id D0301160AD2; Mon, 5 Sep 2016 14:28:48 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id F12DA160AB0 for ; Mon, 5 Sep 2016 16:28:47 +0200 (CEST) Received: (qmail 44157 invoked by uid 500); 5 Sep 2016 14:28:47 -0000 Mailing-List: contact dev-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@impala.incubator.apache.org Delivered-To: mailing list dev@impala.incubator.apache.org Received: (qmail 43979 invoked by uid 99); 5 Sep 2016 14:28:46 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Sep 2016 14:28:46 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 76305C0115 for ; Mon, 5 Sep 2016 14:28:46 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.362 X-Spam-Level: X-Spam-Status: No, score=0.362 tagged_above=-999 required=6.31 tests=[RDNS_DYNAMIC=0.363, SPF_PASS=-0.001] autolearn=disabled Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id Eq732EZGXqk5 for ; Mon, 5 Sep 2016 14:28:44 +0000 (UTC) Received: from ip-10-146-233-104.ec2.internal (ec2-75-101-130-251.compute-1.amazonaws.com [75.101.130.251]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id 780A75FC38 for ; Mon, 5 Sep 2016 14:28:44 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by ip-10-146-233-104.ec2.internal (8.14.4/8.14.4) with ESMTP id u85ESh0V000760; Mon, 5 Sep 2016 14:28:43 GMT Message-Id: <201609051428.u85ESh0V000760@ip-10-146-233-104.ec2.internal> Date: Mon, 5 Sep 2016 14:28:43 +0000 From: "Zoltan Ivanfi (Code Review)" To: impala-cr@cloudera.com, dev@impala.incubator.apache.org CC: Lars Volker , Matthew Jacobs Reply-To: zi@cloudera.com X-Gerrit-MessageType: comment Subject: =?UTF-8?Q?=5BImpala-ASF-CR=5D_IMPALA-3973=3A_add_position_and_occurrence_to_instr=28=29=0A?= X-Gerrit-Change-Id: Ie9648de458d243306fa14adc5e7f7002bf6f67fd X-Gerrit-ChangeURL: X-Gerrit-Commit: 326450f5b8a30dad13004ef7c8146261fd04d5a7 In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Content-Disposition: inline User-Agent: Gerrit/2.12.2 archived-at: Mon, 05 Sep 2016 14:28:49 -0000 Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-3973: add position and occurrence to instr() ...................................................................... Patch Set 6: (17 comments) http://gerrit.cloudera.org:8080/#/c/4094/5/be/src/exprs/string-functions-ir.cc File be/src/exprs/string-functions-ir.cc: PS5, Line 289: StringValue str_sv = StringValue::FromStringVal(str); : StringValue substr_sv = StringValue::FromStringVal(substr); : StringSearch search(&substr_sv); : // Hive returns positions starting from 1. : return IntVal(search.Search(&str_sv) + 1); : } > I think it makes sense to unify this by using the new method. If there's bu Done PS5, Line 329: n + start_pos > I think we could just remove the comment. The next line by itself explains Done http://gerrit.cloudera.org:8080/#/c/4094/6/be/src/exprs/string-functions-ir.cc File be/src/exprs/string-functions-ir.cc: PS6, Line 311: search_start_pos < 0 > remove- this is not possible since we know it's not 0 from l302 and we know Indeed, sorry - in an earlier version of my code this check was done in a place common to the regular and reverse search. PS6, Line 311: { : return IntVal(0); > this will be short enough to return on the same line Done PS6, Line 318: if (match_pos_in_substring == -1) { : return IntVal(0); : } > 1 line Done PS6, Line 322: search_start_pos > I think you also need to make sure search_start_pos is < haystack.len befor I don't think it's necessary. I check the starting position before the loop, so we start with a non-negative length for the Substring call. Then in each iteration of the loop, one of two things can happen: 1. There is no match in the substring, in which case the function returns immediately. 2. There is a match, in which case the new starting position is one char after the match position. Because the match position must be inside the string, match_pos < haystack.len, therefore search_start_pos = match_pos + 1 <= haystack.len (actually it will be strictly less, because the length of the match must be > 0). PS6, Line 330: search_start_pos >= str.len > This isn't possible, str.len = haystack.len, and we know we added something Indeed, sorry - in an earlier version of my code this check was done in a place common to the regular and reverse search. PS6, Line 330: if (search_start_pos < 0 || search_start_pos >= str.len) { : return IntVal(0); : } > remove We still need the < 0 check. PS6, Line 336: search_start_pos + needle.len > why do you add the needle len? Because the position of the match is the index of its leftmost char, but we need to include the match itself in the substring where we search. For example, if we are searching for "234" in "0123456" and we want to find matches where the match index is <= 2, the substring "012" is not enough, we have to search in "01234". PS6, Line 338: if (match_pos == -1) { : return IntVal(0); : } > 1 line Done http://gerrit.cloudera.org:8080/#/c/4094/5/be/src/exprs/string-functions.h File be/src/exprs/string-functions.h: PS5, Line 64: context > nit: remove to match the surrounding code. Done http://gerrit.cloudera.org:8080/#/c/4094/5/be/src/runtime/string-search-test.cc File be/src/runtime/string-search-test.cc: Line 28: return StringValue(const_cast(str), len); > You're right. Unfortunately the style guide wasn't clear in this regard. Wh Done PS5, Line 69: ECT_EQ(-1, TestSearch("abca > I see, let's fix it here then at least. Please rephrase to "Test searching Done PS5, Line 70: > Ok, makes sense to me. Can you add this as a comment to the factory methods Done Line 74: EXPECT_EQ(-1, TestSearch("abcdabaaaaa", "abaaaaaa", 4, 3)); > Same here Done Line 81: EXPECT_EQ(2, TestSearch("abcdede", "cde")); > How about "Tests with needle at the end of the haystack." then? Doesn't mak Done http://gerrit.cloudera.org:8080/#/c/4094/6/be/src/runtime/string-search.h File be/src/runtime/string-search.h: PS6, Line 102: for (int i = 0; i < mlast; ++i) { : BloomAdd(pattern_->ptr[i]); : if (pattern_->ptr[i] == pattern_->ptr[mlast]) : skip_ = mlast - i - 1; : } : BloomAdd(pattern_->ptr[mlast]); > I don't think this will always work for RSearch, see https://hg.python.org/ Nice catch, thanks. I think we may be able to do better than the Python implementation here. That one iterates in reverse indeed, and for this reason they moved this part immediately before the search, since the direction is not known during construction. However, since the search may be reused for searching repeatedly, it would be better to build the bloom filter only once. The contents of the bloom filter do not depend on the order of adding elements to it. The only difference between iterating forwards vs. backwards is the value of skip_. It will always depend on the last position (in the order of iterating) where the pattern contains the same char as the last char of the pattern (again, in the order of iterating). This means that in a single loop we can populate the bloom filter and save two different skip_ value - one for searching forwards and one for searching backwards. I will create a test that catches this error then modify the code the way I described above. -- To view, visit http://gerrit.cloudera.org:8080/4094 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie9648de458d243306fa14adc5e7f7002bf6f67fd Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zoltan Ivanfi Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes