hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benoit Sigoure (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HBASE-2323) filter.RegexStringComparator does not work with certain bytes
Date Sun, 14 Mar 2010 09:09:27 GMT

     [ https://issues.apache.org/jira/browse/HBASE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Benoit Sigoure updated HBASE-2323:
----------------------------------

    Description: 
I'm trying to use {{RegexStringComparator}} in conjunction with {{RowFilter}}.  One of my
row keys contained the byte 0xA, which turns out to be the ASCII code for the newline character
(\n).  When the row key is converted to a string in order to use the regexp facility of the
Java standard library, it becomes a string containing two lines and my regexp does not match.

I believe the solution is to compile the regexp with the {{DOTALL}} flag.  Luckily, this flag
can be "passed" by the client by prefixing the regexp with {{(?s)}} so people working with
an older version of HBase can work around this issue without having to upgrade.


Second problem: One of my row keys contained the sequence {{0x00 0x00 0x9D}} ({{0x9D}} = -99
when stored in a Java {{byte}}) but in {{compareTo}} the row key is transformed in a {{String}}
using {{Bytes.toString}}, which just assumes that the byte array is an UTF8 encoded string.
 Java "cleverly" substituted the 0x9D byte with 0x63 (character '?').  In my case, I want
to use encoding ISO-8859-1 as it preserves every byte when the byte array is converted to
a {{String}} and back to a byte array, unlike UTF-8 or ASCII.  Should we add a new method
to {{RegexStringComparator}} to allow the user to specify their own {{Charset}} instance?

  was:
I'm trying to use {{RegexStringComparator}} in conjunction with {{RowFilter}}.  One of my
row keys contained the byte 0xA, which turns out to be the ASCII code for the newline character
(\n).  When the row key is converted to a string in order to use the regexp facility of the
Java standard library, it becomes a string containing two lines and my regexp does not match.

I believe the solution is to compile the regexp with the {{DOTALL}} flag.  Luckily, this flag
can be "passed" by the client by prefixing the regexp with {{(?s)}} so people working with
an older version of HBase can work around this issue without having to upgrade.

        Summary: filter.RegexStringComparator does not work with certain bytes  (was: filter.RegexStringComparator
does not work in presence of the byte 0xA)

> filter.RegexStringComparator does not work with certain bytes
> -------------------------------------------------------------
>
>                 Key: HBASE-2323
>                 URL: https://issues.apache.org/jira/browse/HBASE-2323
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: filters
>    Affects Versions: 0.20.3
>            Reporter: Benoit Sigoure
>            Assignee: Benoit Sigoure
>
> I'm trying to use {{RegexStringComparator}} in conjunction with {{RowFilter}}.  One of
my row keys contained the byte 0xA, which turns out to be the ASCII code for the newline character
(\n).  When the row key is converted to a string in order to use the regexp facility of the
Java standard library, it becomes a string containing two lines and my regexp does not match.
> I believe the solution is to compile the regexp with the {{DOTALL}} flag.  Luckily, this
flag can be "passed" by the client by prefixing the regexp with {{(?s)}} so people working
with an older version of HBase can work around this issue without having to upgrade.
> Second problem: One of my row keys contained the sequence {{0x00 0x00 0x9D}} ({{0x9D}}
= -99 when stored in a Java {{byte}}) but in {{compareTo}} the row key is transformed in a
{{String}} using {{Bytes.toString}}, which just assumes that the byte array is an UTF8 encoded
string.  Java "cleverly" substituted the 0x9D byte with 0x63 (character '?').  In my case,
I want to use encoding ISO-8859-1 as it preserves every byte when the byte array is converted
to a {{String}} and back to a byte array, unlike UTF-8 or ASCII.  Should we add a new method
to {{RegexStringComparator}} to allow the user to specify their own {{Charset}} instance?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message