jackrabbit-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Jackrabbit Wiki] Update of "JCR Binary Usecase" by IanBoston
Date Thu, 15 Sep 2016 08:53:32 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Jackrabbit Wiki" for change notification.

The "JCR Binary Usecase" page has been changed by IanBoston:

Added comments to some of the use cases.

  For this feature to work the web layer like Sling needs to know the path to the binary.
Note that path is not disclosed to the client. 
- To an extent this feature is similar to UC1 however here the scope is more broader 
+ To an extent this feature is similar to UC1 however here the scope is more broader.
+ NB Although mod_xsendfile in Apache needs a path on the file system local to the Apache
instance to get the binary it is a pointer. It does not need to be the same path that Oak
uses internally provided it is presented as the patch on the file system. Other variants of
X-sendfile (https://www.nginx.com/resources/wiki/start/topics/examples/xsendfile/) allow that
pointer to be resoved to an http location for streaming. 
  === UC9 - S3 datastore in a cluster ===
@@ -120, +122 @@

  How to ensure Oak GC doesn't delete the binary too early. One solution is that if the native
library reads the file (or knows it will need to read the file soon), it updates the last
modified time. This should already work. Another solution might be to add the file name to
a text file (read log), but it would probably be more complicated, and probably wouldn't improve
performance much.
+ '''Ian'''
+ This use case needs to cover file systems and other storage mechanisms like S3. Controlling
access is outside the scope of what Oak can control and depends on deployment teams.
  === UC3 - Text Extraction without temporary File with Tika ===
@@ -130, +135 @@

  Similar to Tika, we could extend the binary and add the missing / required features (for
example get a `ByteBuffer`, which might be backed by a memory mapped file).
+ '''Ian'''
+ Transfers should drill down to the underlying stream to see if they support NIO and use
it if present. An example being the DS is a File system DS which has NIO, as does S3 the Jetty
stream also supports NIO, so it should be possible for a servlet to get hold of both streams
and connect the channels. This requires that the streams are available directly which requires
the rest of the implementation to be efficient enough to not require local caching and copies
of the files. There are a number of issues in Sling that need addressing first, some of which
are being worked on. Streaming uploads, Streamed downloads to IE11 etc. I dont think adding
NIO capabilities to streams that dont natively support NIO is the right solution and will
only hide a more fundamental issue. The biggest issue (imvho) is that JCR Binary doesn't provide
both an OutputStream and an InputStream directly connected to the raw underlying storage,
blocking the client from performing zero cost transfers available to most other stacks. 
  === UC5 - Transferring the file to `FileDataStore` with minimal overhead ===
@@ -148, +156 @@

  The Oak `BlobStore` chunks binaries, so that chunks could be shared among binaries. Random
writes could then copy just the references to the chunks if possible. That would make random
write relatively efficient, but still binaries would be immutable. We would need to add the
required API. Please note this only works when using the `BlobStore`, not the current `FileDataStore`
and `S3DataStore` as is (at least a wrapper around the `FileDataStore` / `S3DataStore` would
be needed). This includes efficiently cutting away some bytes in the middle of a binary, or
inserting some bytes. Typical file system don't support this case efficiently, however with
the BlobStore it is possible.
+ === UC8 - X-SendFile ===
+ '''Ian'''
+ The Aim of X-Sendfile is to offload the streaming of large binaries from an expensive server
capable of performing complex authorizations to a less expensive farm of servers capable of
streaming data to a large number of clients. The X-sendfile header provides a location where
the upstream proxy can find the response. That location has to be resolvable to the upstream
server, it may contain authZ for the response and it may not divulge the structure of the
store of neighbouring resources. What can be achieved depends on the implementation of the
upstream servers X-sendfile capability. The apache mod_xsendfile module only supports mapping
the location to the filesystem, so the DS would have to be mounted. Other X-Sendfile implementations,
like nginX X-Accel-redriect that created the concept, supports mapping the location through
to any URI location including through to http locations. This would allow the X-Accel-Redirect
location to be mapped through to a http location capable of serving > C10K request all
streaming. In AWS, ELBs support signed URLs, so if a S3 store needed to be exposed, the URL
 X-Accel-Redirect location could be a S3 bucket location fronted by an ELB configured to only
allow access to signed requests conforming to a ACL policy. That policy including token expiry.
Other variants of this are possible including requiring signed urls and hosting the content
behind an elastic farm of Node.js/Jetty or any C10K capable server, each one validating the
signature and token on every request from the nginX front end.  To achieve this Oak or Sling
would need to expose the pointer to the binary, and document a signing structure giving access
to that binary. If the identifier of the Binary is already exposed, via JCR properties, this
may already be possible, with knowledge of the DS without any changes to Oak.
+ Documentation on the AWS ELB signed urls is here http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/private-content-signed-urls.html
+ Documentation on nginX's original concept is here https://www.nginx.com/resources/wiki/start/topics/examples/xsendfile/

  === UC9 - S3 datastore in a cluster ===
  Possible solutions are: (1) disable async upload; (2) not sure if that works correctly:
use a shared local cache (NFS for example). Other solutions (would need more work probably)
could be to send / request the binary using the broadcasting cache mechanism.
+ '''Ian'''
+ This has been partially addressed with streaming uploads in Sling Engine 2.4.6 and Sling
Servlets Post 2.3.14. WHen async cache is disabled, the session.save() connects the request
InputStream for the upload directly to the S3 OutputStream performing the transfer with no
local disk IO, using a byte[]. As noted under UC4, this should be done over NIO wherever possible.
Downloads of the binary also need to be streamed in a similar way. Local Disk IO is reported
to be an expensive commodity by those who are deploying Sling/AEM at scale.

View raw message