hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "BELUGA BEHR (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-14525) org.apache.hadoop.io.Text Truncate
Date Tue, 13 Jun 2017 17:31:00 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-14525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

BELUGA BEHR updated HADOOP-14525:
---------------------------------
    Description: 
For Apache Hive, VARCHAR fields are much slower than STRING fields when a precision (string
length cap) is included.  Keep in mind that this precision is the number of UTF-8 characters
in the string, not the number of bytes.

The general procedure is:

# Load an entire byte buffer into a {{Text}} object
# Convert it to a {{String}}
# Count N number of character code points
# Substring the {{String}} at the correct place
# Convert the String back into a byte array and populate the {{Text}} object

It would be great if the {{Text}} object could offer a truncate/substring method based on
character count that did not require copying data around.  Along the same lines, a "getCharacterLength()"
method may also be useful to determine if the precision has been exceeded.

  was:
For Apache Hive, VARCHAR fields are much slower than STRING fields when a precision (string
length cap) is included.  Keep in mind that this precision is the number of UTF-8 characters
in the string, not the number of bytes.

The general procedure is:

# Load an entire byte buffer into a {{Text}} object
# Convert it to a {{String}}
# Count N number of character code points
# Substring the {{String}} at the correct place
# Convert the String back into a byte array and populate the {{Text}} object

It would be great if the {{Text}} object could offer a truncate/substring method based on
character count that did not require copying data around


> org.apache.hadoop.io.Text Truncate
> ----------------------------------
>
>                 Key: HADOOP-14525
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14525
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 2.8.1
>            Reporter: BELUGA BEHR
>
> For Apache Hive, VARCHAR fields are much slower than STRING fields when a precision (string
length cap) is included.  Keep in mind that this precision is the number of UTF-8 characters
in the string, not the number of bytes.
> The general procedure is:
> # Load an entire byte buffer into a {{Text}} object
> # Convert it to a {{String}}
> # Count N number of character code points
> # Substring the {{String}} at the correct place
> # Convert the String back into a byte array and populate the {{Text}} object
> It would be great if the {{Text}} object could offer a truncate/substring method based
on character count that did not require copying data around.  Along the same lines, a "getCharacterLength()"
method may also be useful to determine if the precision has been exceeded.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message