hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kai Zheng (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7337) Configurable and pluggable Erasure Codec and schema
Date Fri, 13 Mar 2015 02:38:39 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359825#comment-14359825

Kai Zheng commented on HDFS-7337:

Thanks [~zhz] for the review and thoughts.
bq.ErasureCodec is like a factory or an utility class, which creates ErasureCoder and BlockGrouper
based on ECSchema
{{ErasureCodec}} would be the high level construct in the framework that covers all the potential
erasure code specific aspects, including but might not be limited to {{ErasureCoder}} and
{{BlockGrouper}}, which allows to be implemented and deployed as a whole for a new code. All
the underlying code specific logic can be hooked via codec and can only be accessible thru
codec. I understand there will be something more to think about, it's generally one of the
major goal for the framework.
bq.I think we can leverage the pattern of BlockStoragePolicySuite
It's a good pattern. {{ErasureCodec}} follows another good pattern, {{CompressionCodec}}.

bq. Something like:...your illustration codes...
I understand we need to hard-code a default schema for the system. What we have discussed
and been doing is we allow to predefine EC schemas in an external file (XML currently used
as we regularly do in the project). For easy reference, unique schema name (string) and codec
name (string) are used. Do you have any concern for this way ?
bq.Then NN can just pass around the schema ID when communicating with DN and client, which
is much smaller than an ErasureCodec object.
Yes similarly it's to pass around the schema NAME between any pair among NN, DN, client. It's
not meaning to pass ErasureCodec object. Is there confusing sentence I need to clarify in
the doc ? All the {{ErasureCodec}}s are loaded thru core-site configuration or service locators,
and kept in map with codec name as the key. Providing the codec name, a codec will be fetched
from the map. Codec object isn't needed to be passed around, codec name is. I guess you're
meaning schema object. In the f2f meetup discussion with [~jingzhao], we mentioned it may
need to pass around schema object. If we don't want to hard-code all the schemas, then we
need to pass schema object I guess.

> Configurable and pluggable Erasure Codec and schema
> ---------------------------------------------------
>                 Key: HDFS-7337
>                 URL: https://issues.apache.org/jira/browse/HDFS-7337
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Zhe Zhang
>            Assignee: Kai Zheng
>         Attachments: HDFS-7337-prototype-v1.patch, HDFS-7337-prototype-v2.zip, HDFS-7337-prototype-v3.zip,
PluggableErasureCodec-v2.pdf, PluggableErasureCodec.pdf
> According to HDFS-7285 and the design, this considers to support multiple Erasure Codecs
via pluggable approach. It allows to define and configure multiple codec schemas with different
coding algorithms and parameters. The resultant codec schemas can be utilized and specified
via command tool for different file folders. While design and implement such pluggable framework,
it’s also to implement a concrete codec by default (Reed Solomon) to prove the framework
is useful and workable. Separate JIRA could be opened for the RS codec implementation.
> Note HDFS-7353 will focus on the very low level codec API and implementation to make
concrete vendor libraries transparent to the upper layer. This JIRA focuses on high level
stuffs that interact with configuration, schema and etc.

This message was sent by Atlassian JIRA

View raw message