jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Mueller (JIRA)" <j...@apache.org>
Subject [jira] Updated: (JCR-926) Global data store for binaries
Date Thu, 28 Jun 2007 10:45:26 GMT

     [ https://issues.apache.org/jira/browse/JCR-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Thomas Mueller updated JCR-926:
-------------------------------

    Attachment: dataStore.patch

Hi,

This is a refactoring patch for GlobalDataStore. The patch introduces DataStore (almost) wherever
it is required, but the behavior is not yet changed (the data store is disabled). This patch
may break backwards compatibility.

NodeImpl.internalCopyPropertyFrom: Never used, removed.

ItemStateBinding.readState and writeState: Never used, removed.

Deprecated class org.apache.jackrabbit.core.state.PMContext and org.apache.jackrabbit.core.state.util.Serializer:
Removed. Adding a parameter would break backwards compatibility anyway.

The parameter 'DataStore store' was added to many constructors and methods. I don't like it.
Would there be a better way to do it? Idea: create a new class 'RepositoryContext' with getNodeTypeRegistry(),
maybe getNamespaceResolver(), getNamespaceRegistry(), and getDataStore(). Pass this object
where appropriate.

Sometimes BLOBs are used only for a short time. I renamed the method create(InputStream in)
to createTemporary.

BLOBFileValue is now an abstract class. The original implementation was renamed to 'BLOBFileValueOld'.
This is only a temporary class (until DataStore is done). There is also BLOBFileValueMemory
for very small binary properties (a few hundres bytes), but currently not used.

The DataStore parameter is still missing in InternalValue.valueOf (this method is never called
for BINARY types), this will be changed.

InternalValue: BOOLEAN_TRUE and BOOLEAN_FALSE is fixed now. 



A few notes about the FileDataStore implementation:

I didn't change Jukka's implementation so far, but I have a few ideas:

Currently all files are stored in the same directory. However this is a problem for Windows
XP (and may be other file systems). I would limit the number of files in the data store root
directory to 1024. Afterwards, create subdirectories data1024-2047, data2048-3071,... with
1024 files each. When required, FileDataStore reads the directory list. If faster, one index
file per directory could be created. 

The file name is currently the SHA-1 digest. I suggest to use SHA-256 (unless it is a lot
slower or not available on some systems). Yes you can call me paranoid. SHA-1 could be broken
in a few years.

As the file name, I would use: <id>-<digest>.data. As the DataIdentifier, use
<id>-<digest>. This would speed up finding files when reading, as (id / 1024)
is the directory (direct lookup). Also this would allow to bundle data files in tar files.
Tar file support would be priority 2. I would only bundle very small (< 4 KB) files in
tar files anyway. Priority 3 would be compression (for text data mainly).

There is no garbage collection at this time. This still needs to be implemented.

Thomas


> Global data store for binaries
> ------------------------------
>
>                 Key: JCR-926
>                 URL: https://issues.apache.org/jira/browse/JCR-926
>             Project: Jackrabbit
>          Issue Type: New Feature
>          Components: core
>            Reporter: Jukka Zitting
>         Attachments: dataStore.patch, DataStore.patch, DataStore2.patch, internalValue.patch,
ReadWhileSaveTest.patch
>
>
> There are three main problems with the way Jackrabbit currently handles large binary
values:
> 1) Persisting a large binary value blocks access to the persistence layer for extended
amounts of time (see JCR-314)
> 2) At least two copies of binary streams are made when saving them through the JCR API:
one in the transient space, and one when persisting the value
> 3) Versioining and copy operations on nodes or subtrees that contain large binary values
can quickly end up consuming excessive amounts of storage space.
> To solve these issues (and to get other nice benefits), I propose that we implement a
global "data store" concept in the repository. A data store is an append-only set of binary
values that uses short identifiers to identify and access the stored binary values. The data
store would trivially fit the requirements of transient space and transaction handling due
to the append-only nature. An explicit mark-and-sweep garbage collection process could be
added to avoid concerns about storing garbage values.
> See the recent NGP value record discussion, especially [1], for more background on this
idea.
> [1] http://mail-archives.apache.org/mod_mbox/jackrabbit-dev/200705.mbox/%3c510143ac0705120919k37d48dc1jc7474b23c9f02cbd@mail.gmail.com%3e

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message