Mailing-List: contact commits-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@jackrabbit.apache.org
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
From: Apache Wiki <wikidiffs@apache.org>
To: Apache Wiki <wikidiffs@apache.org>
Date: Wed, 04 Sep 2013 00:41:57 -0000
Message-ID: <20130904004157.72705.64084@eos.apache.org>
Subject: 
 =?utf-8?q?=5BJackrabbit_Wiki=5D_Trivial_Update_of_=22JackrabbitFileVaultF?=
 =?utf-8?q?S=22_by_TobiasBocanegra?=
Auto-Submitted: auto-generated

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Jackrabbit Wiki" fo=
r change notification.

The "JackrabbitFileVaultFS" page has been changed by TobiasBocanegra:
https://wiki.apache.org/jackrabbit/JackrabbitFileVaultFS

New page:
'''''work in progress'''''
----
<<TableOfContents(4)>>

=3D=3D Introduction =3D=3D
we see in various applications the need for a simple jcr repository to file=
system mapping. for example in source management tools, fileserver bindings=
, import/export stuff etc. if a jcr repository would only consist of `nt:fi=
le` and `nt:folder`, this would be easy. but if other nodetypes are used (e=
ven a simple as extending from `nt:file`) the mapping to the filesystem is =
not so trivial anymore. the idea is to provide a general all-purpose mechan=
ism to export to and import from a standard (java.io based) filesystem.

The !VaultFs is designed to provide a general filesystem abstraction of a J=
CR repository. It provides the following features:

 intuitive mapping:: A `nt:file` should just map to a simple file, a `nt:fo=
lder` to a directory. More complex node types should map to a `nodename.xml=
` and a possible `nodename` folder that contains the child nodes or be aggr=
egated to a complete or partial serialization.

 universal api:: the api should be suitable for all filesystem based applic=
ations like WebDAV, CIFS, SCM Integration, FileVault, etc.

 extendable:: A plugin mechanism should allow to extend the mapping layer f=
or further conversions filters and aggregators.

=3D=3D Overview =3D=3D
<img style=3D"float:right" src=3D"%topic.attachments%/vaultfs.png" />

!VaultFs consists mainly of 2 layers that map the repository's nodes to !Va=
ultFs files: The '''Aggregate Node Tree''' that is managed by the ''aggrega=
te manager'' represents a hierarchical view of the content aggregates. Each=
 aggregate is addressed by a path and allows access to its artifacts. The a=
rtifacts nodes are built using ''aggregators'' that define which repository=
 items belong to an aggregate and what artifacts they produce. For each art=
ifact there is a ''serializer'' defined that is used to export and import t=
he respective content. =


On top of the aggregate tree is the '''Vault File System''' that accesses t=
he aggregates and exposes them as tree of ''vault files''. They can be used=
 to export and import the actual repository content. The mapping from aggre=
gates and its artifacts to vault files is done in an intuitive way so that =
clients (and users) can deal with them in a natural filesystem like fashion.

{{%topic.attachments%/vault_sample.png|Example Tree}}

=3D=3D Aggregate Manager =3D=3D
The aggregate manager is configured with a set of aggregators and serialize=
rs. Once the manager is mounted on a jcr repository it exposes a tree of ag=
gregates. They are collected using an aggregator that matches the respectiv=
e repository node. For example the ''nt:file aggregator'' produces an artif=
acts node that allows no further child nodes and provides (usually) one pri=
mary artifact (which represents the content of the file).

=3D=3D=3D Artifacts =3D=3D=3D
an artifact is one aspect or part of a content aggregation. the following a=
rtifact types exist:
 * Directory Artifacts
 * File Artifacts
 * Primary Artifacts
 * Binary Artifacts

'''Directory''' artifacts represent the folder aspect of an aggregate. For =
example a pure =3Dnt:folder=3D would produce an aggregate with just one sol=
e directory artifact.

'''File''' artifacts represent file aggregates. since the `nt:file` handlin=
g is very special there is an special type for it.

'''Primary''' artifacts represent the main aggregate. This usually contains=
 all nodes and properties that belong to the aggregate that cannot be expre=
ssed by another type.

'''Binary''' artifacts represent binary content that is not included in the=
 primary or file artifacts. This is for example suitable for binary propert=
ies that were not included in a xml deserialization. This allows keeping th=
e deserializations leaner and more efficient.

=3D=3D Content Aggregation =3D=3D
A subtree of nodes will be aggregated semantically into one entity, the agg=
regate. This mainly consists of a path and a set of artifacts and may have =
child aggregates.

the mechanism how content aggregation works is defined by a set of '''filte=
rs''' with corresponding '''aggregators'''. if we look at the export in a r=
ecursive way, it would work as follows:
 # traverse the repository starting at the root node
 # for each node check which filter matches
 # execute the respective aggregator and create a new aggregate
 # if aggregator allows child nodes descend into the excluded nodes

=3D=3D=3D Aggregates =3D=3D=3D
an aggregate is a tree of repository items that belong together and are map=
ped to (a set of) artifacts. the artifacts represent filesystem resources. =
the aggregate type is defined by the aggregator type and not primarily by t=
he content. i.e. the selected aggregator must return stable coverage inform=
ation which is not dependent of the actual content.

there can be identified 4 types of aggregates.

=3D=3D=3D=3D Full coverage aggregates =3D=3D=3D=3D
they aggregate an entire subtree. for example the complete serialization of=
 a `nt:nodeType` node or a ''dialog definition''. they are very simple to d=
eal with, since the root node of the aggregate is usually serialized into 1=
 filesystem file.

The following repository structure:
{{{
+ nodetypes [nt:unstructured]
  + nt1 [nt:nodeType]
    + jcr:propertyDefinition [nt:propertyDefinition]
    + jcr:propertyDefinition [nt:propertyDefinition]
    + jcr:childNodeDefinition [nt:childNodeDefinition]
  + nt2 [nt:nodeType]
    ...
}}}
could be mapped to:
{{{
`- nodetypes
   |- nt1.cnd
   `- nt2.cnd
}}}

=3D=3D=3D=3D Generic aggregates =3D=3D=3D=3D
generic aggregates cover a part of a content subtree, hence they have not a=
 full coverage. they always consist at least of a primary artifact and a di=
rectory artifact. examples of those are the aggregation of a `cq:Page` stru=
cture or of `nt:unstructured` nodes. =


the following repository structure:
{{{
+ en [cq:Page]
  + jcr:content [cq:Content]
  + about [cq:Page]
    + jcr:content [cq:Content]
      + header [cq:Content]
        + image.jpg
  + solutions [cq:Page
    + jcr:content [cq:Content]
}}}

are mapped to:
{{{
`- en
   |- .content.xml
   |- about
   |  |- _jcr_content
   |  |  `- header
   |  |     `- image.jpg
   |  `- .content.xml
   `- solution
      `- .content.xml
}}}

the example above just excluded some direct child nodes of the aggregate ro=
ot from the aggregation (with the exception of the `image.jpg` node). but t=
his could be more complicated.

overlapping example:
{{{
+ apps [nt:unstructured]
  + example [nt:unstructured]
    + components [nt:unstructured]
      + image [cq:Component]
        + dialog [cq:Dialog]
          ...  =

        + default.jsp [nt:file] =

}}}

is be mapped to:
{{{
`- apps
   |- .content.xml
   `- example
      |- .content.xml
      `- components
        |- .content.xml
         `- image
            |- .content.xml
            |- dialog.xml
            `- default.jsp
}}}

this example has 6 aggregates:
 # the generic aggregate for `apps`
 # the generic aggregate for `example`
 # the generic aggregate for `components`
 # the generic aggregate for `image`
 # the `default.jsp` file aggregate
 # the `dialog.xml` full coverage aggregate

=3D=3D=3D=3D Simple File aggregates =3D=3D=3D=3D
since files (`nt:file` nodes and extents) are common they are treated diffe=
rently in aggregation. the simplest mapping is to create a filesystem file =
for each `nt:file`. unfortunately there is some information in a default `n=
t:file` that cannot be preserved in the filesystem. namely:
 * `jcr:created` property
 * `jcr:content/jcr:uuid` property
 * `jcr:content/jcr:encoding` property
 * `jcr:content/jcr:mimeType` property

so in order to achieve a complete serialization there is an extra artifact =
needed to store this info.
but to keep the mapping lean, those properties are not part of the file agg=
regate but 'delegated' to its parent aggregate.

example:
{{{
+ foo [nt:folder]
  + example.jsp [nt:file]
    - jcr:created ...
    + jcr:content [nt:resource]
      - jcr:data
      - jcr:lastModified
      - jcr:mimeType
}}}

is mapped to:
{{{
`- foo
   |- .content.xml
   `- example.jsp
}}}

the `.content.xml` will include the properties that are not handled by the =
`example.jsp`

=3D=3D=3D=3D Extended File aggregates =3D=3D=3D=3D
when `nt:file` nodes are extended, either by primary or mixin type, the pri=
mary artifact remains the generic serialization of the resource. additional=
 information needs to be serialized to an extra artifact.

example:
{{{
+ sample.jpg [dam:file]
  - jcr:created
  + jcr:content [dam:resource]
    - jcr:lastModified
    + dam:thumbnails [nt:folder]
      - 90.jpg [nt:file]
      - 120.jpg [nt:file]
}}}

are be mapped to:
{{{
|- sample.jpg
`- sample.jpg.dir
   |- .content.xml
   `- _jcr_content
      `- _dam_thumbnails
         |- 90.jpg
         `- 120.jpg
}}}

=3D=3D=3D=3D Folder aggregates =3D=3D=3D=3D
pure `nt:folder` aggregates will result in one directory and mostly in an a=
dditional `.content.xml`

=3D=3D=3D=3D Binary Properties =3D=3D=3D=3D
There is some special handling for binary properties other than `jcr:data` =
in a `jcr:content` node. =

example (although this is probably very rare):
{{{
+ foo [nt:unstructured]
  + bar [nt:unstructured]
    + 0001 [nt:unstructured]
      - data1 (binary)
      - data2 (binary)
    + 0002 [nt:unstructured]
      - data1 (binary)
      - data2 (binary)
}}}

is mapped to:
{{{
`- foo
   |- .content.xml
   `- bar
      |- 0001
      |  |- data1.bin
      |  `- data2.bin
      `- 0002
         |- data1.bin
         `- data2.bin      =

}}}

=3D=3D=3D=3D Resource Nodes =3D=3D=3D=3D
there are some cases where `nt:resource` like structures are used that are =
not held below a `nt:file` node.
{{{
+ foo [nt:unstructured]
  + cq:content [nt:resource]
    - jcr:mimeType "image/jpg"
    - jcr:data  =

    - jcr:lastModified
}}}

this is mapped to:
{{{
`- foo
   |- .content.xml
   `- _cq_content.jpg
}}}
where as the mimetype and modification date can be recorded in the primary =
artifact. possible other properties like `jcr:uuid` etc would go to the par=
ent aggregate.

=3D=3D=3D=3D Filename escaping =3D=3D=3D=3D
not all of the character in a jcr name are allowed filesystem characters an=
d need escaping. the normal case is to use the 'url encoding', i.e. using a=
 `%` followed by the hexnumber of the character. but this look ugly, especi=
ally for the colon `:`, eg a `cq:content` would become `cq%3acontent`. so f=
or the namespace prefix there is a special escaping by replacing it by a un=
derscores. eg: `cq:content` will be `_cq_content`. nodes already having thi=
s patter will be escaped using a double underscore. eg: `_test_image.jpg` w=
ould be `~__test_image.jpg`.

more examples:

||'''node name'''|'''file name'''||
|| `test.jpg`          || `test.jpg`                     ||
|| `cq:content`        || `_cq_content`                 ||
|| `test_image.jpg`    || `test_image.jpg`               ||
|| `_testimage.jpg`    || `_testimage.jpg`              ||
|| `_test_image.jpg`   || `__test_image.jpg`            ||
|| `cq:test:image.jpg` || `_cq_test%3aimage.jpg` ^1^    ||

'''^1^''' this is a very rare case and justifies the ugly `%3a` escaping.

=3D=3D Serialization =3D=3D
The serialization of the artifacts is defined by the =3Dserializer=3D that =
is provided by the aggregator. Currently there are only 3 kind of serializa=
tions used: a direct data serialization for the contents of file or binary =
artifacts, a ''CND'' serialized for `nt:nodeType` nodes and an enhanced ''d=
ocview'' serialization for the rest. The ''docview'' serialization that is =
used allows multi value properties (and might be enhanced by a better prope=
rty type support).

=3D=3D Deserialization =3D=3D
Although for exporting only 3 serialization types are used this is a bit di=
fferent for importing. The importer analyzes the provided input sources and=
 determines the following serialization types:
 * generic XML
 * docview XML
 * sysview XML
 * generic data
Depending on the configuration those input sources can be handled different=
ly. currently they are imported as follows:

'''generic XML''' produces a `nt:file` having a `jcr:content` of the deseri=
alization of the xml document (if importing into CRX then the `crx:XmlDocum=
ent` nodetypes and friends are used).

'''docview XML''' is more or less imported directly below the respective im=
port root.

'''sysview XML''' is more or less imported directly below the respective im=
port root.

'''generic data''' produces a `nt:file` having the data as `nt:resource` co=
ntent.

=3D=3D Vault File System Layer =3D=3D
The !VaultFs layer provides a mapping from the aggregate tree to a file sys=
tem. The goal is to keep the amount of files lean and as natural as possibl=
e with a minimum amount of extra files. =


=3D=3D Terminology =3D=3D
 ~VaultFs:: The File Vault Filesystem. Provides file-like abstraction of a =
JCR repository.
 ~VaultFile:: A ~VaultFs entity that represents a file-like abstraction of =
a (partial) repository node tree.
 Aggregate:: Represents an addressable collection of artifacts.
 Aggregator:: Interface that defines the methods for building content aggre=
gates.
 Artifact:: Representation of a content aggregate. An aggregator can provid=
e several artifacts. An artifact is either mapped to a file or a directory =
and can be of the type:
  * primary
  * file
  * binary
  * directory

 Serializer:: Interface that defines the methods for serializing an artifac=
t.
 Artifact handler:: Interface that defines methods for deserializing artifa=
cts.