jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Francesco Mari (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (OAK-4833) Document storage format changes
Date Tue, 13 Dec 2016 12:23:58 GMT

    [ https://issues.apache.org/jira/browse/OAK-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745023#comment-15745023
] 

Francesco Mari commented on OAK-4833:
-------------------------------------

For this list of changes to make sense to an external reader - e.g. a new contributor - we
need at least three pieces of information.
# An initial design document for oak-segment, covering the binary format. This should answer
the question "Where is the Segment Store comes from?".
# A list of changes made to the binary format. This should answer the question "How did we
arrive here?"
# An up-to-date design document describing the current status of the Segment Store. This should
answer the question "How does it work?"
I think we have 1 because of our current design documents, we will have 3 as part of OAK-4648,
so this issue is about achieving 2. Is my understanding correct?


> Document storage format changes
> -------------------------------
>
>                 Key: OAK-4833
>                 URL: https://issues.apache.org/jira/browse/OAK-4833
>             Project: Jackrabbit Oak
>          Issue Type: Technical task
>          Components: doc, segment-tar
>            Reporter: Michael Dürig
>            Assignee: Michael Dürig
>              Labels: documentation
>             Fix For: 1.6, 1.5.17
>
>
> This issue serves as collection of all changes to the storage format introduced with
 Oak Segment Tar and their impact. Once sufficiently stabilised this information should serve
as basis for the documentation in {{oak-doc}}. 
> || Change || Rational || Impact || Migration || Since || Issues ||
> |Generation in segment header |Required to unequivocally determine the generation of
a segment during cleanup. Segment retention time is given in number of generations (2 by default).
|No performance, space impact expected |offline |0.0.2 |OAK-3348 | 
> |Stable id for node states |Required to efficiently determine equality of node states.
This can be seen as an intermediate step to decoupling the address of records from their identity.
The next step is to introduce logical record ids (OAK-4659). |Node states increase by the
size of one record id (3 bytes / 20 bytes after OAK-4631). On top of that there is an additional
block record à 18 bytes per node state. |offline |0.0.2 |OAK-3348
> |Binary index in tar files |Avoid traversing the repository to collect the gc roots for
DSGC. Fetch them from an index instead. |Additional index entry per tar file. Adds a couple
of bytes per external binary to each tar file. Exact size to be determined. [~frm] could you
help with this? OAK-4740 is a regression wrt. to resiliency caused by this change (and the
fact that the blob store might return blob ids longer than 2k chars).  |offline |0.0.4 |OAK-4101
> |Simplified record ids |Preparation and precondition for logical record ids (OAK-4659).
At the same time the simplest possible fix for OAK-2896. The latter leads to degeneration
of segment sizes, which in turn has adverse effects on overall performance, resource utilisation
and memory requirements. Without this fix OAK-2498 would need to be fixed in a different way
that would require other changes in the storage format. I started to regard this issue as
removing a premature optimisation (which caused OAK-2498). OTOH with OAK-4844 we should also
start looking into mitigations and what those would mean to size vs. simplicity vs. performance.
 |Record ids grow from 3 bytes to 18 bytes when serialised into records. Impact on repositories
to be assessed but can be anywhere between almost none to x6. OAK-4812 is a performance regression
caused by this chance. Its overall impact is yet to be assessed. |offline |0.0.10 |OAK-4631,
OAK-4844
> |Storage format versioning |In order to be able to further evolve the storage format
with minimal impact on existing deployments we need to carefully versions the various storage
entities (segments, tar files, etc.) |No performance, space impact expected |offline |0.0.2/
0.0.10 |OAK-4232, OAK-4683, OAK-4295
> |Logical record ids |We need to separate addresses of records from their identity to
be able to further scale the TarMK. OAK-3348 (the online compaction misery) can be seen as
a symptom of failing to understand this earlier. The stable ids introduced with OAK-3348 are
a first step into this direction. However this is not sufficient to implement features like
e.g. background compaction (OAK-4756), partial compaction (OAK-3349) or incremental compaction
(OAK-3350).  |A small size overhead per segment for the logical id table. Further impact to
be evaluated ([~frm], please add your assessment here). |offline |0.0.14 (planned) |OAK-4659
> |External index for segments |Avoid recreating tar files if indexes are corrupt/missing.
Just recreate the indexes. |Faster startup after a crash. Overall less disk space usage as
no unnecessary backup files are created. |online |not yet planned |OAK-4649
> |In-place journal |Reduce complexity by in-lining the journal log. Less files, less chances
to break something. Also the granularity of the log would increase as flushing of the persisted
head would not be required any more. Resilience would improve as the roll-back functionality
could operate at a finer granularity. |No more journal.log. Better resiliency. Significant
risk for regression of OAK-4291 if not implemented properly. Most likely a significant refactoring
of some parts of the code is required before we can proceed with this issue.  |online |not
yet planned |OAK-4103
> |Root record types |With the information currently available from the segment headers
we cannot collect statistics about segment usage on repositories of non trivial sizes. This
fix would allow us to build more scalable tools to that respect.  |None expected wrt. to performance
and size under normal operation. |offline |0.0.14 (planned) (waiting for OAK-4659 as implementation
depends on how we progress there) |OAK-2498
> Misc ideas currently on the back burner:
> * SegmentMK: Arch segments (OAK-1905)
> * Extension headers for segments (no issue yet)
> * More memory efficient serialisation of values (e.g. boolean) (no issue yet)
> * Protocol Buffer for serialising records (no issue yet)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message