jackrabbit-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Jackrabbit Wiki] Update of "JCR Binary Usecase" by ChetanMehrotra
Date Fri, 27 May 2016 05:27:30 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Jackrabbit Wiki" for change notification.

The "JCR Binary Usecase" page has been changed by ChetanMehrotra:

New page:
'''NOTE''' - Draft and work in progress

Below are few usecases which have been seen in past which cannot be met with current Oak Binary
support. All such cases are meant to improve performance by reducing IO in cases where possible.
Feature wise current Stream based approach supported by JCR Binary meets all requirement.
The objective of this document is to capture such usecases  and then come up with ways/solutions
to meet these requirements. Which of the below requirements should we try to meet is something
to be discussed. Objective here is to have some usecases to initiate the discussion

Per current design some implementation details

 1. S3DataStore
  a. While performing any read the stream from S3Object is first copied to a local cache and
then a FileInputStream is provided from that
  a. Due to above even if code needs to read initial few bytes (say vedio metadata) then also
the whole file is first spooled to a file in local cache and then stream is opened on that
 1. FileDataStore
  a. The file are stored in a directory structure like /xx/yy/zz/<contenthash> where
xx,yy,zz are the initial few letters of the hex encoded content hash
  a. Upon writing the stream is first written to a temporary file and then it is renamed.
In case of NFS based DataStore this would essentially means file is written twice!. This design
problem was solved with FileBlobStore in Oak but that is not being used in production. So
something which we need to live with

Currently the JCR Binary interface only allows InputStream based access to the binary content.
In certain cases where the deployment is using certain type of BlobStore like FileDataStore
or S3DataStore its desirable that an optimal path can be leveraged if possible

=== UC1 - Image Rendition generation ===

''Need access to absolute path of the File which back JCR Binary when using FileDataStore
for processing by native program''

DataStore - FileDataStore

There are deployments where lots of images gets uploaded to the repository and there are some
conversions (rendition generation) which are performed by OS specific native executable. Such
programs work directly on file handle.

Without this change currently we need to first spool the file content into some temporary
location and then pass that to the other program. This add unnecessary overhead and something
which can be avoided in case there is a FileDataStore being used where we can provide a direct
access to the file

=== Efficient replication across regions in S3 ===

''For binary less replication in non shared DataStore across multiple regions need access
to S3Object ID backing the blob such that it can be efficient copied to a bucket in different
region via S3 Copy Command''

DataStore - S3DataStore

This for setup which is running on Oak with S3DataStore. There we have global deployment where
author instance is running in 1 region and binary content is to be distributed to publish
instances running in different regions. The DataStore size is huge say 100TB and for efficient
operation we need to use Binary less replication. In most cases only a very small subset of
binary content would need to be present in other regions. Current way (via shared DataStore)
to support that would involve synchronizing the S3 bucket across all such regions which would
increase the storage cost considerably. 

Instead of that plan is to replicate the specific assets via s3 copy operation. This would
ensure that big assets can be copied efficiently at S3 level

=== UC3 - Text Extraction without temporary File with Tika ===

''Avoid creation of temporary file where possible''

While performing text extraction by Tika in many cases it would be creating a temporary file
as many parser need random access to the binary. So while using BlobStore where per implementation
the binary exist as File we can use a TikaInputStream backed by that file which would avoid
creation of such temporary file and thus speed up text extraction performance

Going forward if we need to make use of [[https://issues.apache.org/jira/browse/TIKA-416|Out
of Process Text Extraction]] then this aspect would be useful there also

=== UC4 - Spooling the binary content to socket output via NIO ===

''Enable use of NIO based zero copy file transfers''

DataStore - S3DataStore, FileDataStore

For some time [[https://github.com/eclipse/jetty.project/blob/a12fd9ea033678b51158949a886792b74b42d0a9/examples/embedded/src/main/java/org/eclipse/jetty/embedded/FastFileServer.java|Jetty
has support]] for doing async io and do a [[http://www.ibm.com/developerworks/library/j-zerocopy/|zero
copy]] file transfer. This would allow transferring the file content to the HTTP socket without
making it pass through jvm and thus should improve throughput.

Key aspect here is that where possible we should be able to avoid IO. Also have a look at
[[https://kafka.apache.org/08/design.html#maximizingefficiency|Kafka design]] which tries
to make use of OS cache as much as possible and avoid Io via jvm if possible thus providing
much better throughputs

=== UC5 - Transferring the file to FileDataStore with minimal overhead ===

'' Need a way to construct JCR Binary via a File reference where File instance  "ownership
is transferred" say via rename without spooling its content again''

DataStore - FileDataStore

In some deployments a customer would typically upload lots of files in a FTP folder and then
from there the files are transferred to Oak. As mentioned in 2b above with NAS based storage
 this would result in file being copied twice. So to avoid the extra overhead it would be
helpful if one can create a File directly in the NFS as per FileDataStore structure (content
hash -> split 3 level) and then add the Binary via ReferenceBinary approach

=== UC6 - S3 import ===

This somewhat similar to previous case but more around S3 support

Usecase here is that a customer has lots of existing binaries and those need to be imported
to Oak repository. The binary might already exist on S3 or on there existing systems. S3 has
lots of tooling to import large data sets efficiently , so its faster to bulk upload such
binaries to an S3 bucket and then somehow transfer them to Oak for further management

The problem though: how to efficiently get them into the S3DS, ideally without moving them

=== UC7 - Editing large files ===

Think: a video file exposed onto the desktop via WebDAV. Desktop tools would do random writes
in that file. How can we cover this use case without up/downloadin the large file. (essentially:
random write access in binaries)

View raw message