hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henry Hung <YTHu...@winbond.com>
Subject RE: RegexStringComparator problem: Why pattern "u" has the same result as ".*u.*" ?
Date Mon, 16 Jun 2014 03:50:59 GMT
I found out the problem:

I think I know what is going on, inside the RegexStringComparator, the compareTo is using
find() rather than matches():

  public int compareTo(byte[] value, int offset, int length) {
    // Use find() for subsequence match instead of matches() (full sequence
    // match) to adhere to the principle of least surprise.
    String tmp;
    if (length < value.length / 2) {
      // See HBASE-9428. Make a copy of the relevant part of the byte[],
      // or the JDK will copy the entire byte[] during String decode
      tmp = new String(Arrays.copyOfRange(value, offset, offset + length), charset);
    } else {
      tmp = new String(value, offset, length, charset);
    }
    return pattern.matcher(tmp).find() ? 0 : 1;
  }


I use a simple program to test the difference between matches() and find():

String s = "hung";
Pattern p = Pattern.compile("u", Pattern.DOTALL);
Matcher m = p.matcher(s);
System.out.println(m.matches()); // return false
System.out.println(m.find());          // return true

p = Pattern.compile(".*u.*", Pattern.DOTALL);
m = p.matcher(s);
System.out.println(m.matches()); // return true
System.out.println(m.find());          // return false

The method matches() is what I needed right now, and to me it is more reasonable to use, but
I don't know how to change it without modify the source code.

@Ted:
What you are suggesting is true, but for our user base it rather counterintuitive, because
we are accustomed to searching keyword with expression "abc.*" to search with prefix "abc"
rather than have to explicitly use "^abc.*".
If I can't change the RegexStringComparator  compareTo() method from "find()" to "matches()",
then I suppose I can implement a hard fix by adding "^" at the beginning of search keyword.
Thanks for you quick responses.

Best regards,
Henry

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com]
Sent: Monday, June 16, 2014 11:32 AM
To: user@hbase.apache.org
Subject: Re: RegexStringComparator problem: Why pattern "u" has the same result as ".*u.*"
?

"u" is part of "hung", producing a match.

Do you want to find string whose value is "u" (not a substring) ?
In that case you can specify "^u$"

Cheers


On Sun, Jun 15, 2014 at 8:20 PM, Henry Hung <YTHung1@winbond.com> wrote:

>
> I have this data set and the value I want to test is "cf:c" = "hung":
>
> hbase(main):001:0> scan 'TEST'
> ROW                                                          COLUMN+CELL
> \x00\x00\x00\x03abc\x00\x00\x00\x02                         column=cf:a,
> timestamp=1402649511909, value=abc
> \x00\x00\x00\x03abc\x00\x00\x00\x02                         column=cf:b,
> timestamp=1402649511909, value=\x00\x00\x00\x02
> \x00\x00\x00\x03abc\x00\x00\x00\x02                         column=cf:c,
> timestamp=1402649511909, value=def
> \x00\x00\x00\x03abc\x00\x00\x00\x02                         column=cf:d,
> timestamp=1402649511909, value=\x00\x00\x01F\x93\x81s\xA8
> \x00\x00\x00\x03abc\x00\x00\x00\x03                         column=cf:a,
> timestamp=1402649610557, value=abc
> \x00\x00\x00\x03abc\x00\x00\x00\x03                         column=cf:b,
> timestamp=1402649610557, value=\x00\x00\x00\x03
> \x00\x00\x00\x03abc\x00\x00\x00\x03                         column=cf:c,
> timestamp=1402649610557, value=def
> \x00\x00\x00\x03abc\x00\x00\x00\x03                         column=cf:d,
> timestamp=1402649610557, value=\x00\x00\x01F\x93\x81s\xA8
> \x00\x00\x00\x03abc\x00\x00\x00\x04                         column=cf:a,
> timestamp=1402650015602, value=abc
> \x00\x00\x00\x03abc\x00\x00\x00\x04                         column=cf:b,
> timestamp=1402650015602, value=\x00\x00\x00\x04
> \x00\x00\x00\x03abc\x00\x00\x00\x04                         column=cf:c,
> timestamp=1402650015602, value=def
> \x00\x00\x00\x03abc\x00\x00\x00\x04                         column=cf:d,
> timestamp=1402650015602, value=\x00\x00\x01F\x93\x81s\xA8
> \x00\x00\x00\x05henry\x00\x00\x00\x06                       column=cf:a,
> timestamp=1402886404698, value=henry
> \x00\x00\x00\x05henry\x00\x00\x00\x06                       column=cf:b,
> timestamp=1402886404698, value=\x00\x00\x00\x06
> \x00\x00\x00\x05henry\x00\x00\x00\x06                       column=cf:c,
> timestamp=1402886404698, value=hung
> \x00\x00\x00\x05henry\x00\x00\x00\x06                       column=cf:d,
> timestamp=1402886404698, value=\x00\x00\x01F\xA2\x8A\xBD\xA0
> \x00\x00\x00\x06abcdef\x00\x00\x00\x01                      column=cf:a,
> timestamp=1402650022755, value=abcdef
> \x00\x00\x00\x06abcdef\x00\x00\x00\x01                      column=cf:b,
> timestamp=1402650022755, value=\x00\x00\x00\x01
> \x00\x00\x00\x06abcdef\x00\x00\x00\x01                      column=cf:c,
> timestamp=1402650022755, value=def
> \x00\x00\x00\x06abcdef\x00\x00\x00\x01                      column=cf:d,
> timestamp=1402650022755, value=\x00\x00\x01F\x93\x81s\xA8
> \x00\x00\x00\x06abcdef\x00\x00\x00\x02                      column=cf:a,
> timestamp=1402650025763, value=abcdef
> \x00\x00\x00\x06abcdef\x00\x00\x00\x02                      column=cf:b,
> timestamp=1402650025763, value=\x00\x00\x00\x02
> \x00\x00\x00\x06abcdef\x00\x00\x00\x02                      column=cf:c,
> timestamp=1402650025763, value=def
> \x00\x00\x00\x06abcdef\x00\x00\x00\x02                      column=cf:d,
> timestamp=1402650025763, value=\x00\x00\x01F\x93\x81s\xA8
> 6 row(s) in 0.1090 seconds
>
>
> I wrote some program to test it:
>
> HTable conn = new HTable(HBaseConfiguration.create(), "TEST"); try {
>                 Scan scan = new Scan();
>                 RegexStringComparator comp = new
> RegexStringComparator("u");
>                 SingleColumnValueFilter filter =new
> SingleColumnValueFilter(Bytes.toBytes("cf"), Bytes.toBytes("c"),
> CompareOp.EQUAL, comp);
>                 FilterList filters = new
> FilterList(Operator.MUST_PASS_ALL);
>                 filters.addFilter(filter);
>                 scan.setFilter(filters);
>                 ResultScanner rs = conn.getScanner(scan);
>                 try {
>                                 Result r = rs.next();
>
> System.out.println(Bytes.toString(r.getValue(Bytes.toBytes("cf"),
> Bytes.toBytes("c"))));
>                 }
>                 finally {
>                                 rs.close();
>                 }
> }
> finally {
>                 conn.close();
> }
>
> Because I use regex "u" as the value comparator, the program should
> throw a null value exception.
> But when execute it, the result is "hung".
>
> Question is why the SingleColumnValueFilter do not abide the regex
> comparator? Or why is regex comparator "u" is the same as ".*u.*"?
>
> Best regards,
> Henry Hung
>
> ________________________________
> The privileged confidential information contained in this email is
> intended for use only by the addressees as indicated by the original
> sender of this email. If you are not the addressee indicated in this
> email or are not responsible for delivery of the email to such a
> person, please kindly reply to the sender indicating this fact and
> delete all copies of it from your computer and network server
> immediately. Your cooperation is highly appreciated. It is advised
> that any unauthorized use of confidential information of Winbond is
> strictly prohibited; and any information in this email irrelevant to
> the official business of Winbond shall be deemed as neither given nor endorsed by Winbond.
>

The privileged confidential information contained in this email is intended for use only by
the addressees as indicated by the original sender of this email. If you are not the addressee
indicated in this email or are not responsible for delivery of the email to such a person,
please kindly reply to the sender indicating this fact and delete all copies of it from your
computer and network server immediately. Your cooperation is highly appreciated. It is advised
that any unauthorized use of confidential information of Winbond is strictly prohibited; and
any information in this email irrelevant to the official business of Winbond shall be deemed
as neither given nor endorsed by Winbond.
Mime
View raw message