Author: chetanm Date: Mon Jul 17 07:23:28 2017 New Revision: 1802100 URL: http://svn.apache.org/viewvc?rev=1802100&view=rev Log: OAK-6081 - Indexing tooling via oak-run Added: jackrabbit/site/live/oak/docs/features/oak-run-nodestore-connection-options.html (with props) jackrabbit/site/live/oak/docs/query/oak-run-indexing.html (with props) jackrabbit/site/live/oak/docs/query/pre-extract-text.html (with props) Added: jackrabbit/site/live/oak/docs/features/oak-run-nodestore-connection-options.html URL: http://svn.apache.org/viewvc/jackrabbit/site/live/oak/docs/features/oak-run-nodestore-connection-options.html?rev=1802100&view=auto ============================================================================== --- jackrabbit/site/live/oak/docs/features/oak-run-nodestore-connection-options.html (added) +++ jackrabbit/site/live/oak/docs/features/oak-run-nodestore-connection-options.html Mon Jul 17 07:23:28 2017 @@ -0,0 +1,301 @@ + + + + + + + + + Jackrabbit Oak – Oak Run NodeStore Connection + + + + + + + + Fork me on GitHub + + + +
+ + + +
+ +
+

Oak Run NodeStore Connection

+

@since Oak 1.7.1

+

This page provide details around various options supported by some of the oak-run commands to connect to NodeStore repository. By default most of these commands (unless documented) would connect in read only mode.

+

These options are supported by following command (See OAK-6210)

+ +
    + +
  • console
  • + +
  • index
  • + +
  • tika
  • +
+

Depending on your setup you would need to configure the NodeStore and BlobStore in use for commands to work. Some commands may not require the BlobStore details. Check the specific oak-run command help to see if access to BlobStore is required or not.

+
+

NodeStore

+
+

SegmentNodeStore

+

To connect to SegmentNodeStore just specify the path to folder used by SegmentNodeStore for storing the repository content

+ +
+
java -jar oak-run <command> /path/to/segmentstore
+
+
+

DocumentNodeStore - Mongo

+

To connect to Mongo specify the MongoURI

+ +
+
java -jar oak-run <command> mongodb://server:port
+
+

It support some other options like cache size, cache distribution etc. Refer to help output via -h to see supported options

+
+

DocumentNodeStore - RDB

+

«TBD»

+
+

BlobStore

+
+

FileDataStore

+

Specify the path to directory used by FileDataStore via --fds-path option

+ +
+
java -jar oak-run <command> /path/to/segmentstore --fds-path=/path/to/fds
+
+
+

S3DataStore

+

Specify the path to config file which contains connection details related to S3 bucket to be used via -s3ds option

+ +
+
java -jar oak-run <command> /path/to/segmentstore --s3ds=/path/to/S3DataStore.config
+
+

The file should be a valid config file as configured S3DataStore in OSGi setup for pid org.apache.jackrabbit.oak.plugins.blob.datastore.S3DataStore.config.

+

Do change the path property to location based on system from where command is being used. If you are running the command on the setup where the Oak application is running then ensure that path is set to a different location.

+
+
+
+
+ + + \ No newline at end of file Propchange: jackrabbit/site/live/oak/docs/features/oak-run-nodestore-connection-options.html ------------------------------------------------------------------------------ svn:eol-style = native Added: jackrabbit/site/live/oak/docs/query/oak-run-indexing.html URL: http://svn.apache.org/viewvc/jackrabbit/site/live/oak/docs/query/oak-run-indexing.html?rev=1802100&view=auto ============================================================================== --- jackrabbit/site/live/oak/docs/query/oak-run-indexing.html (added) +++ jackrabbit/site/live/oak/docs/query/oak-run-indexing.html Mon Jul 17 07:23:28 2017 @@ -0,0 +1,395 @@ + + + + + + + + + Jackrabbit Oak – Oak Run Indexing + + + + + + + + Fork me on GitHub + + + +
+ + + +
+ +
+

Oak Run Indexing

+

@since Oak 1.7.0

+

Work in progress. Not to be used on production setups

+

With Oak 1.7 we have added some tooling as part of oak-run index command. Below are details around various operations supported by this command.

+

The index command supports connecting to different NodeStores via various options which are documented here. Example below assume a setup consisting of SegmentNodeStore and FileDataStore. Depending on setup use the appropriate connection options.

+

By default the tool would generate output file in directory indexing-result which is referred to as output directory.

+

Unless specified all operations connect to the repository in read only mode

+
+

Common Options

+

All the commands support following common options

+ +
    + +
  1. --index-paths - Comma separated list of index paths for which the selected operations need to be performed. If not specified then the operation would be performed against all the indexes.
  2. +
+

Also refer to help output via -h command for some other options

+
+

Generate Index Info

+ +
+
java -jar oak-run*.jar index --fds-path=/path/to/datastore  /path/to/segmentstore/ --index-info 
+
+

Generates a report consisting of various stats related to indexes present in the given repository. The generated report is stored by default in <output dir>/index-info.txt

+

Supported for all index types

+
+

Dump Index Definitions

+ +
+
java -jar oak-run*.jar index --fds-path=/path/to/datastore  /path/to/segmentstore/ --index-definitions
+
+

--index-definitions operation dumps the index definition in json format to a file <output dir>/index-definitions.json. The json file contains index definitions keyed against the index paths

+

Supported for all index types

+
+

Dump Index Data

+ +
+
java -jar oak-run*.jar index --fds-path=/path/to/datastore  /path/to/segmentstore/ --index-dump
+
+

--index-dump operation dumps the index content in output directory. The output directory would contain one folder for each index. Each folder would have a property file index-details.txt which contains indexPath

+

Supported for only Lucene indexes.

+
+

Index Consistency Check

+ +
+
java -jar oak-run*.jar index --fds-path=/path/to/datastore  /path/to/segmentstore/ --index-consistency-check
+
+

--index-consistency-check operation performs index consistency check against various indexes. It supports 2 level

+ +
    + +
  • Level 1 - Specified as --index-consistency-check=1. Performs a basic check to determine if all blobs referred in index are valid
  • + +
  • Level 2 - Specified as --index-consistency-check=2. Performs a more through check to determine if all index files are valid and no corruption has happened. This check is slower
  • +
+

It would generate a report in <output dir>/index-consistency-check-report.txt

+

Supported for only Lucene indexes.

+
+

Reindex

+

The reindex operation supports 2 modes of index

+ +
    + +
  • Out-of-band indexing - Here oak-run would connect to repository in read only mode. It would require certain manual steps
  • + +
  • Online Indexing - Here oak-run would connect to repository in --read-write mode
  • +
+

Supported for only Lucene indexes.

+

If the indexes being reindex have fulltext indexing enabled then refer to Tika Setup for steps on how to adapt the command to include Tika support for text extraction

+
+

A - out-of-band indexing

+

Out of band indexing has following phases

+ +
    + +
  1. Get checkpoint issued
  2. + +
  3. Perform indexing with read only connection to NodeStore upto checkpoint state
  4. + +
  5. Import the generated indexes
  6. + +
  7. Complete the increment indexing from checkpoint state to current head
  8. +
+
+

Step 1 - Text PreExtraction

+

If the index being reindexed involves fulltext index and the repository has binary content then its recommended that first text pre-extraction is performed. This ensures that costly operation around text extraction is done prior to actual indexing so that actual indexing does not do text extraction in critical path

+
+

Step 2 - Create Checkpoint

+

Go to CheckpointMBean and create a checkpoint with lifetime of 1 month. «TBD»

+
+

Step 3 - Perform Reindex

+

In this step we perform the actual indexing via oak-run where it connects to repository in read only mode.

+ +
+
 java -jar oak-run*.jar index --fds-path=/path/to/datastore  /path/to/segmentstore/ --reindex --index-paths=/oak:index/indexName
+
+

Here following options can be used

+ +
    + +
  • --pre-extracted-text-dir - Directory path containing pre extracted text generated via step #1
  • + +
  • --index-paths - This command requires an explicit set of index paths which need to be indexed
  • + +
  • --checkpoint - The checkpoint up to which the index is updated, when indexing in read only mode. For testing purpose, it can be set to ‘head’ to indicate that the head state should be used.
  • +
+
+

Step 4 - Import the index

+

As a last step we need to import the index back in the repository. This can be done in one of the following ways

+
+
4.1 - Via oak-run
+

In this mode we import the index using oak-run

+ +
+
java -jar oak-run*.jar index --index-import --read-write --index-import-dir=<index dir> /path/to/segmentstore
+
+

Here “index dir” is the directory which contains the index files created in step #3. Check the logs from previous command for the directory path.

+

This mode should only be used when repository is from Oak version 1.7+ as oak-run connects to the repository in read-write mode.

+
+
4.2 - Via IndexerMBean
+

In this mode we import the index using JMX. Looks for IndexerMBean and then import the index directory using the importIndex operation

+
+
4.3 - Via script
+

TODO - Provide a way to import the data on older setup using some script

+
+

B - Online indexing

+

Online indexing automates some of the manual steps which are required for out-of-band indexing.

+

This mode should only be used when repository is from Oak version 1.7+ as oak-run connects to the repository in read-write mode.

+
+

Step 1 - Text PreExtraction

+

This is same as in out-of-band indexing

+
+

Step 2 - Perform reindexing

+

In this step we configure oak-run to connect to repository in read-write mode and let it perform all other steps i.e checkpoint creation, indexing and import

+ +
+
java -jar oak-run*.jar index --reindex --index-paths=/oak:index/lucene --read-write /path/to/segmentstore
+
+
+

Tika Setup

+

If the indexes being reindex have fulltext indexing enabled then you need to include Tika library in classpath. This is required even if pre extraction is used so as to ensure that any new binary added after pre-extraction is done can be indexed.

+

First download the tika-app jar from Tika downloads. You should be able to use 1.15 version with Oak 1.7.4 jar.

+

Then modify the index command like below. The rest of arguments remain same as documented before.

+ +
+
java -cp oak-run.jar:tika-app-1.15.jar org.apache.jackrabbit.oak.run.Main index
+
+
+
+
+
+ + + \ No newline at end of file Propchange: jackrabbit/site/live/oak/docs/query/oak-run-indexing.html ------------------------------------------------------------------------------ svn:eol-style = native Added: jackrabbit/site/live/oak/docs/query/pre-extract-text.html URL: http://svn.apache.org/viewvc/jackrabbit/site/live/oak/docs/query/pre-extract-text.html?rev=1802100&view=auto ============================================================================== --- jackrabbit/site/live/oak/docs/query/pre-extract-text.html (added) +++ jackrabbit/site/live/oak/docs/query/pre-extract-text.html Mon Jul 17 07:23:28 2017 @@ -0,0 +1,338 @@ + + + + + + + + + Jackrabbit Oak – Pre-Extracting Text from Binaries + + + + + + + + Fork me on GitHub + + + +
+ + + +
+ +
+

Pre-Extracting Text from Binaries

+

@since Oak 1.0.18, 1.2.3

+

Lucene indexing is performed in a single threaded mode. Extracting text from binaries is an expensive operation and slows down the indexing rate considerably. For incremental indexing this mostly works fine but if performing a reindex or creating the index for the first time after migration then it increases the indexing time considerably. To speed up such cases Oak supports pre extracting text from binaries to avoid extracting text at indexing time. This feature consist of 2 broad steps

+ +
    + +
  1. Extract and store the extracted text from binaries using oak-run tooling.
  2. + +
  3. Configure Oak runtime to use the extracted text at time of indexing via PreExtractedTextProvider
  4. +
+

For more details on this feature refer to OAK-2892

+
+

A - Oak Run Pre-Extraction Command

+

Oak run tool provides a tika command which supports traversing the repository and then extracting text from the binary properties.

+
+

Step 1 - oak-run Setup

+

Download following jars

+ +
    + +
  • oak-run 1.7.4
  • +
+

Refer to oak-run setup for details about connecting to different types of NodeStore. Example below assume a setup consisting of SegmentNodeStore and FileDataStore. Depending on setup use the appropriate connection options.

+

You can use current oak-run version to perform text extraction for older Oak setups i.e. its fine to use oak-run from 1.7.x branch to connect to Oak repositories from version 1.0.x or later. The oak-run tooling connects to the repository in read only mode and hence safe to use with older version.

+

The generated extracted text dir can then be used with older setup.

+
+

Step 2 - Generate the csv file

+

As the first step you would need to generate a csv file which would contain details about the binary property. This file would be generated by using the tika command from oak-run. In this step oak-run would connect to repository in read only mode.

+

To generate the csv file use the --generate action

+ +
+
    java -jar oak-run.jar tika \
+    --fds-path /path/to/datastore \
+    /path/to/segmentstore --data-file oak-binary-stats.csv --generate
+
+

If connecting to S3 this command can take long time because checking binary id currently triggers download of the actual binary content which we do not require. To speed up here we can use the Fake DataStore support of oak-run

+ +
+
    java -jar oak-run.jar tika \
+    --fake-ds-path=temp \
+    /path/to/segmentstore --data-file oak-binary-stats.csv --generate
+
+

This would generate a csv file with content like below

+ +
+
43844ed22d640a114134e5a25550244e8836c00c#28705,28705,"application/octet-stream",,"/content/activities/jcr:content/folderThumbnail/jcr:content"
+43844ed22d640a114134e5a25550244e8836c00c#28705,28705,"application/octet-stream",,"/content/snowboarding/jcr:content/folderThumbnail/jcr:content"
+...
+
+

By default it scans whole repository. If you need to restrict it to look up under certain path then specify the path via --path option.

+
+

Step 3 - Perform the text extraction

+

Once the csv file is generated we need to perform the text extraction. To do that we would need to download the tika-app jar from Tika downloads. You should be able to use 1.15 version with Oak 1.7.4 jar.

+

To perform the text extraction use the --extract action

+ +
+
    java -cp oak-run.jar:tika-app-1.15.jar \
+    org.apache.jackrabbit.oak.run.Main tika \
+    --data-file binary-stats.csv \
+    --store-path ./store  \
+    --fds-path /path/to/datastore  extract
+
+

This command does not require access to NodeStore and only requires access to the BlobStore. So configure the BlobStore which is in use like FileDataStore or S3DataStore. Above command would do text extraction using multiple threads and store the extracted text in directory specified by --store-path.

+

Currently extracted text files are stored as files per blob in a format which is same one used with FileDataStore In addition to that it creates 2 files

+ +
    + +
  • blobs_error.txt - File containing blobIds for which text extraction ended in error
  • + +
  • blobs_empty.txt - File containing blobIds for which no text was extracted
  • +
+

This phase is incremental i.e. if run multiple times and same --store-path is specified then it would avoid extracting text from previously processed binaries.

+

Further the extract phase only needs access to BlobStore and does not require access to NodeStore. So this can be run from a different machine (possibly more powerful to allow use of multiple cores) to speed up text extraction. One can also split the csv into multiple chunks and process them on different machines and then merge the stores later. Just ensure that at merge time blobs*.txt files are also merged

+

Note that we need to launch the command with -cp instead of -jar as we need to include classes outside of oak-run jar like tika-app. Also ensure that oak-run comes before in classpath. This is required due to some old classes being packaged in tika-app

+
+

B - PreExtractedTextProvider

+

In this step we would configure Oak to make use of the pre extracted text for the indexing. Depending on how indexing is being performed you would configure the PreExtractedTextProvider either in OSGi or in oak-run index command

+
+

Oak application

+

@since Oak 1.0.18, 1.2.3

+

For this look for OSGi config for Apache Jackrabbit Oak DataStore PreExtractedTextProvider

+ +
+
![OSGi Configuration](pre-extracted-text-osgi.png)   
+
+

Once PreExtractedTextProvider is configured then upon reindexing Lucene indexer would make use of it to check if text needs to be extracted or not. Check TextExtractionStatsMBean for various statistics around text extraction and also to validate if PreExtractedTextProvider is being used.

+
+

Oak Run Indexing

+

Configure the directory storing pre extracted text via --pre-extracted-text-dir option in index command. See oak run indexing

+
+
+
+
+ + + \ No newline at end of file Propchange: jackrabbit/site/live/oak/docs/query/pre-extract-text.html ------------------------------------------------------------------------------ svn:eol-style = native