hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7240) Object store in HDFS
Date Thu, 02 Nov 2017 17:59:03 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236271#comment-16236271

Steve Loughran commented on HDFS-7240:

Anu, I've got some more comments. Given the size of this JIRA, & no of watchers, I'm going
to suggest a low-level "merge HDFS-7240 in JIRA"where we can discuss the low level code details,
and attach a .patch of the entire thing for Jenkins/yetus to handle. This is how we've done
the s3guard work and it helps split code issues from more strategic things like Constantin.

For large scale tests, make sure you have test which scale and the test runner timeout designed
to support multi hour tests. AWS s3 now supports multi-TB files through multipart uploads;
we do run multi-GB long-haul uploads as part of the release process as follows: run

Now, more client side comments. I think this is my client side done, and its more at the HTTP
REST API level than anything else

h3. OzoneFileSystem

* implement getDefaultBlockSize(); add a config option to let people set it. add a sensible
default like 64 or 128 MB.
* you could implement {{copyFromLocalFile}} and {{copyToLocalFile}} trivially using bucket.putKey(dst,
path) & bucket.getKey(). This lines you up for HADOOP-14766, which is a high performance
upload from the local FS to a store

h3. org.apache.hadoop.ozone.web.ozShell.Shell

# {{dispatch}} always returns "1", so doesn't differentiate success (0) from any other error.
# run(String[] args) doesn't check for `parseHandler` returning null; will NPE on parse error,
when normally you'd want to print a usage command.
# Shell bucket options should return some explicit usage erorr

Have a look at what we've done in {{org.apache.hadoop.fs.s3a.s3guard.S3GuardTool}} w.r.t raising
* Any tool (for you: handler) can raise a UnknownOptionException which triggers the usage()
command, no stack trace printed. 
* Any `ExitUtils.ExitException` sets the exit code of the 
* we use `ExitUtil.terminate(status, text)` to exit the shell in main(); which can be turned
off, so that tests
can invoke the full CLI, verify failure modes

Tests: I don't see any CLI tests. Look at {{TestS3GuardCLI}} to see how we do CLI testing,
lookin gfor those ExitExceptions & asserting on the return value,
as well as grabbing all the string output (which can be printed to a string as well as stdout)
& checking that.

h3. OzoneRestClient

Need to plan for a common failure mode being wrong endpoint, possibly returning plaintext
or HTML error messages. Causes include: client config, proxy things. Also failures where the
response is cut off partway through the read. Content-length is your friend here.

* All HTTP requests MUST verify content-type of response. Otherwise a GET of an http page
will return 200 & trigger a parse failure, when its probably "you just got the the URL
* Error handling should handle content: text/plain or text/html and build a string from it.
Why? Jetty &c can raise their own exceptions and they will return text.
* Probably good to print the URL being used here on HTTP failures, as it helps debug foundational
config issues of "Wrong endpoint"
* Anything downloading to a string before parsing should look for a Content-Length header,
and, if set (1) verify that it's within a sensible range & (2) use it to size the download.
Actually, EntityUtils.toString(entity) does that, but it limits the size of a response to
4KB for that reason. Are you confident that all responses parsed that way will be <= 4KB
long? I'd write tests there.
* You can parse JSON straight off an InputStream: do that for any operation which can return
large amounts of data.
* why does {{setEndPointURI(URI)}} not just use {{Preconditions.checkArgument}} and throw
an InvalidArgumentException?
* for consistency {{setEndPoint(String clusterFQDN)}}  needs a {{Preconditions.checkArgument(Strings.isNullOrEmpty(clusterFQDN))}}.
* delete(). Allow for the full set of delete responses (200, 204). Less critical against your
own endpoint, but still best practise.
* putKey. It's very inefficient to generate the checksum before the upload, as it means 1x
scan of the data before the upload. This won't scale efficiently to multiGB uploads. What
about: calculate during the put and have it returned by the endpoint; client to verify the
resulting value. This does mean that a corrupted PUT will overwrite the dest data.
* If you do use the checksum, verify that the result of an invalid checksum on PUT returns
an exception which is handled in OzoneBucket.executePutKey(), that is: 400. 422 or whatever
is sent back by the web engine. Of course, invalid checksum must trigger a failure.
* listVolumes: what's the max # of volumes? Can all volumes be returned in a single 4KB payload
* getKey: verify content type; require content-length & use when reading file. Otherwise
you won't pick up broken GET
* Add a LOG.debug() before every HTTP request for debugging of live systems.

I see handling pooled connections is on the TODO list. Be good to have, obviously. It will
make a significant difference on metadata ops, List etc. The big perf killer here is getKeyInfo,
which is executed 1 or 2 times on OzoneFileSystem.getFileStatus(), and that is used throughout
the hadoop stack on the assumption it's a low cost operation. Here it involves at worst setting
up two HTTP connections. At least here they are not HTTPS connections, which are much more

h3. Ozone Bucket

* OzoneBucket.executeListKeys: again, needs to move to direct JSON parse to handle response
> 4GB
* OzoneBucket.listKeys: javadocs to indicate "number of entries to return" to distinguish
from response length in chars
* OzoneBucket.executeGetKeyInfo. 

h3. REST protocol itself

* I'd have expected KeyInfo to be a HEAD with the fields returned as different headers to
parse. That way the same headers can be returned on a GET, and its consistent with the RESTy
way. Many of the keyinfo fields are the standard HTTP headers anyway (Content-Length, checksum,
last modified)
* And its potentially useful to be able to support custom headers in future.

h3. Testing with TestOzoneRpcClient

* Add a rule for timeouts; critical here
* Parameterize it for both REST and RPC protocols, so guaranteeing equivalent behaviour &
identify regressions.
* TestOzoneRpcClient to use LambdaTestUtils.intercept() to catch the IOE and check the message.
It will (correctly) fail when an operation doesn't raise an exception, and include the output
of the operation instead. {{LambdaTestUtils.intercept(IOException.class, "VOLUME_ALREADY_EXISTS",
() -> Store.createVolume(volumeName))}}

Tests to add 

* Add a test where the OzoneRestClient is pointed at a MiniDFSCluster RPC endpoint, e,g http://localhost:8088/
to verify it fails meaningfully
* Add a test where it's pointed at the miniDFS cluster web UI, again, expect failure''
* Create a volume with an odd name like '%20' or '&'; see what happens.
* Create lots of volumes with long names. List them. 
* Create lots of long keys, verify that listing works

> Object store in HDFS
> --------------------
>                 Key: HDFS-7240
>                 URL: https://issues.apache.org/jira/browse/HDFS-7240
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Jitendra Nath Pandey
>            Assignee: Jitendra Nath Pandey
>            Priority: Major
>         Attachments: HDFS-7240.001.patch, HDFS-7240.002.patch, HDFS-7240.003.patch, HDFS-7240.003.patch,
HDFS-7240.004.patch, Ozone-architecture-v1.pdf, Ozonedesignupdate.pdf, ozone_user_v0.pdf
> This jira proposes to add object store capabilities into HDFS. 
> As part of the federation work (HDFS-1052) we separated block storage as a generic storage
layer. Using the Block Pool abstraction, new kinds of namespaces can be built on top of the
storage layer i.e. datanodes.
> In this jira I will explore building an object store using the datanode storage, but
independent of namespace metadata.
> I will soon update with a detailed design document.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message