impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zoltan Ivanfi (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-3973: add position and occurrence to instr()
Date Mon, 05 Sep 2016 14:28:43 GMT
Zoltan Ivanfi has posted comments on this change.

Change subject: IMPALA-3973: add position and occurrence to instr()
......................................................................


Patch Set 6:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/4094/5/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS5, Line 289:   StringValue str_sv = StringValue::FromStringVal(str);
             :   StringValue substr_sv = StringValue::FromStringVal(substr);
             :   StringSearch search(&substr_sv);
             :   // Hive returns positions starting from 1.
             :   return IntVal(search.Search(&str_sv) + 1);
             : }
> I think it makes sense to unify this by using the new method. If there's bu
Done


PS5, Line 329: n + start_pos
> I think we could just remove the comment. The next line by itself explains 
Done


http://gerrit.cloudera.org:8080/#/c/4094/6/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS6, Line 311: search_start_pos < 0
> remove- this is not possible since we know it's not 0 from l302 and we know
Indeed, sorry - in an earlier version of my code this check was done in a place common to
the regular and reverse search.


PS6, Line 311: {
             :       return IntVal(0);
> this will be short enough to return on the same line
Done


PS6, Line 318:       if (match_pos_in_substring == -1) {
             :         return IntVal(0);
             :       }
> 1 line
Done


PS6, Line 322: search_start_pos
> I think you also need to make sure search_start_pos is < haystack.len befor
I don't think it's necessary. I check the starting position before the loop, so we start with
a non-negative length for the Substring call. Then in each iteration of the loop, one of two
things can happen:
1. There is no match in the substring, in which case the function returns immediately.
2. There is a match, in which case the new starting position is one char after the match position.
Because the match position must be inside the string, match_pos < haystack.len, therefore
search_start_pos = match_pos + 1 <= haystack.len (actually it will be strictly less, because
the length of the match must be > 0).


PS6, Line 330:  search_start_pos >= str.len
> This isn't possible, str.len = haystack.len, and we know we added something
Indeed, sorry - in an earlier version of my code this check was done in a place common to
the regular and reverse search.


PS6, Line 330:     if (search_start_pos < 0 || search_start_pos >= str.len) {
             :       return IntVal(0);
             :     }
> remove
We still need the < 0 check.


PS6, Line 336: search_start_pos + needle.len
> why do you add the needle len?
Because the position of the match is the index of its leftmost char, but we need to include
the match itself in the substring where we search. For example, if we are searching for "234"
in "0123456" and we want to find matches where the match index is <= 2, the substring "012"
is not enough, we have to search in "01234".


PS6, Line 338:       if (match_pos == -1) {
             :         return IntVal(0);
             :       }
> 1 line
Done


http://gerrit.cloudera.org:8080/#/c/4094/5/be/src/exprs/string-functions.h
File be/src/exprs/string-functions.h:

PS5, Line 64: context
> nit: remove to match the surrounding code.
Done


http://gerrit.cloudera.org:8080/#/c/4094/5/be/src/runtime/string-search-test.cc
File be/src/runtime/string-search-test.cc:

Line 28:   return StringValue(const_cast<char*>(str), len);
> You're right. Unfortunately the style guide wasn't clear in this regard. Wh
Done


PS5, Line 69: ECT_EQ(-1, TestSearch("abca
> I see, let's fix it here then at least. Please rephrase to "Test searching 
Done


PS5, Line 70: 
> Ok, makes sense to me. Can you add this as a comment to the factory methods
Done


Line 74:   EXPECT_EQ(-1, TestSearch("abcdabaaaaa", "abaaaaaa", 4, 3));
> Same here
Done


Line 81:   EXPECT_EQ(2, TestSearch("abcdede", "cde"));
> How about "Tests with needle at the end of the haystack." then? Doesn't mak
Done


http://gerrit.cloudera.org:8080/#/c/4094/6/be/src/runtime/string-search.h
File be/src/runtime/string-search.h:

PS6, Line 102:     for (int i = 0; i < mlast; ++i) {
             :       BloomAdd(pattern_->ptr[i]);
             :       if (pattern_->ptr[i] == pattern_->ptr[mlast])
             :         skip_ = mlast - i - 1;
             :     }
             :     BloomAdd(pattern_->ptr[mlast]);
> I don't think this will always work for RSearch, see https://hg.python.org/
Nice catch, thanks. I think we may be able to do better than the Python implementation here.
That one iterates in reverse indeed, and for this reason they moved this part immediately
before the search, since the direction is not known during construction. However, since the
search may be reused for searching repeatedly, it would be better to build the bloom filter
only once.

The contents of the bloom filter do not depend on the order of adding elements to it. The
only difference between iterating forwards vs. backwards is the value of skip_. It will always
depend on the last position (in the order of iterating) where the pattern contains the same
char as the last char of the pattern (again, in the order of iterating). This means that in
a single loop we can populate the bloom filter and save two different skip_ value - one for
searching forwards and one for searching backwards.

I will create a test that catches this error then modify the code the way I described above.


-- 
To view, visit http://gerrit.cloudera.org:8080/4094
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie9648de458d243306fa14adc5e7f7002bf6f67fd
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zoltan Ivanfi <zi@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message