jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shashank Gupta (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (JCR-3733) Asynchronous upload file to S3
Date Thu, 20 Feb 2014 10:59:20 GMT

    [ https://issues.apache.org/jira/browse/JCR-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906853#comment-13906853
] 

Shashank Gupta commented on JCR-3733:
-------------------------------------

h2. Specification 
h3. S3DataStore Asynchronous Upload to S3
The current logic to add a file record to S3DataStore is first add the file in local cache
and then upload that file to S3 in a single synchronous step. This feature contemplates to
break the current logic with synchronous adding to local cache and asynchronous uploading
of the file to S3. Till asynchronous upload completes, all data (inputstream, length and lastModified)
for that file record is fetched from local cache. 
AWS SDK provides [upload progress listeners|http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/event/ProgressListener.html]
which provides various
callbacks on the status of in-progress upload. 

h3. Flag to turn it off
The parameter 'asyncUploadLimit' limits the number of asynchronous uploads to S3. Once this
limit is reached, the next upload to S3 is synchronous till one of asynchronous uploads completes.
To disable this feature, set asyncUploadLimit  parameter to 0 in repository.xml. By default
it is 100.

h3. Caution
# This feature should not be used in clustered Active-Active Jackrabbit deployment. It is
possible that file is not fully uploaded to S3 before it is being accessed on other node.
 For active-passive clustered mode, this feature requires to manually upload uncompleted asynchronous
uploads to S3 after failover.
# If using this feature, it is strongly recommended  to NOT delete any file from local cache
manually.   As local cache may contain files whose uploads are not completed to S3. 

h3. Asynchronous Upload Cache
S3DataStore keeps AsyncUploadCache which holds in-progress asynchronous uploads. This class
contains two data structures, one is \{@link Map<String, Long>\} of file path vs lastModified
 to hold in-progress asynchronous uploads . The other is \{@link Set<String>\} of in-progress
uploads which is marked for delete when asynchronous upload was in-progress. When asynchronous
upload is initiated, an entry is added to this cache and when asynchronous upload completes,
the corresponding entry is flushed.  Any modification to this cache is immediately serialized
to filesytem in a synchronized code block. 

h3.Semantics of various DataStore and DataRecord APIs w.r.t  AsyncUploadCache
Previous to this feature, the S3 is single source of truth. For e.g. DataStore#getRecordIfStored(DataIdentifier)
returns DataRecord if dataIdentifier exists in S3 and else it returns null. It doesn't matter
if dataIdentifier exists in local cache.   With this feature, S3 remains source of truth for
completed uploads and AsyncUploadCache for in-progress asynchronous uploads.  

h4. DataRecord DataStore#addRecord(InputStream)
Checks if asynchronous upload can be started on inputstream based on asyncUploadLimit and
current local cache size. If local cache advised to proceed with asynchronous upload, this
method adds asynchronous upload entry to AsyncUploadCache and start asynchronous upload. 
If no, it proceeds with synchronous upload to S3. If there is already asynchronous upload
in-progress for that dataIdentifier, it just updates lastModified in AsyncUploadCache. Once
asynchronous  uploads completes, the callback removes asynchronous  upload entry from AsyncUploadCache.

 
h4. DataRecord  DataStore#getRecordIfStored(DataIdentifier)
 Return DataRecord if  asynchronous in-progress upload exists in AsyncUploadCache or record
exists in S3 for dataIdentifier. If minModified > 0, timestamp is updated in AsyncUploadCache
and S3. 
 
h4. MultiDataStoreAware#deleteRecord(DataIdentifier)
 For in-progress uploads, this method adds identifier to "toBeDeleted" set in AsyncUploadCache.
When asynchronous  upload completes and invokes callback, the callback checks if asynchronous
in-progress upload is marked for delete. If yes it invokes the deleteRecord to actually delete
the record.
 
h4. DataStore#deleteAllOlderThan(long min )
It deletes deleteAllOlderThan(long min ) records from S3. As AsyncUploadCache maintains map
of in-progress asynchronous  uploads Vs lastModified, it marks asynchronous  uploads for delete
whose lastModified < min. When asynchronous  uploads completes and invokes callback, the
callback checks if asynchronous in-progress upload is marked for delete. If yes it invokes
the deleteRecord to actually delete the record.
 
h4. Iterator<DataIdentifier> DataStore#getAllIdentifiers()
It returns all identifiers in S3 plus in-progress upload identifiers from AsyncUploadCache
minus identifiers from the "toBeDeleted" set in AsyncUploadCache.

h4. long DataRecord#getLength()
If file exits in local cache, it retrieves length from it. Other it retrieves length from
S3. 

h4. DataRecord#getLastModified()
If record is in-progress upload, the lastModified is retrieved from AsyncUploadCache else
it is retrieved from S3. 

h3. Behavior of Local cache Purge
The local cache has a  size limit, when currentsize of cache exceeds the limit, the cache
undergoes auto-purge mode to clean older entries and reclaim space. During purging, local
cache makes sure that it doesn’t delete any in-progress asynchronous upload file. 

h3. DataStore initialization behavior w.r.t. AsyncUploadCache 
It is possible that there are asynchronous  in-progress uploads when server shuts down.  When
asynchronous upload is added to AsyncUploadCache  it is immediately persisted to filesytem
on a file. During  S3DataStore's initialization it checks for any incomplete asynchronous
 uploads and uploads them concurrently in multiple threads. It throws RepositoryException
if file is not found in local cache for that asynchronous upload. As far as code is concerned,
it is only possible when somebody has removed files from local cache manually.  If there is
an exception and user want to proceed with inconsistencies, set parameter contOnAsyncUploadFailure
to true in repository.xml. This will ignore all missing files and reset AsyncUploadCache.


> Asynchronous upload file to S3
> ------------------------------
>
>                 Key: JCR-3733
>                 URL: https://issues.apache.org/jira/browse/JCR-3733
>             Project: Jackrabbit Content Repository
>          Issue Type: Sub-task
>          Components: jackrabbit-core
>            Reporter: Shashank Gupta
>             Fix For: 2.7.5
>
>
> S3DataStore Asynchronous Upload to S3
> The current logic to add a file record to S3DataStore is first add the file in local
cache and then upload that file to S3 in a single synchronous step. This feature contemplates
to break the current logic with synchronous adding to local cache and asynchronous uploading
of the file to S3. Till asynchronous upload completes, all data (inputstream, length and lastModified)
for that file record is fetched from local cache. 
> AWS SDK provides upload progress listeners which provides various callbacks on the status
of in-progress upload.
> As of now customer reported that write performance of EBS based Datastore is 3x  better
than S3 DataStore. 
> With this feature, the objective is to have comparable write performance of S3 DataStore.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message