jackrabbit-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From chet...@apache.org
Subject svn commit: r1802100 - in /jackrabbit/site/live/oak/docs: features/oak-run-nodestore-connection-options.html query/oak-run-indexing.html query/pre-extract-text.html
Date Mon, 17 Jul 2017 07:23:29 GMT
Author: chetanm
Date: Mon Jul 17 07:23:28 2017
New Revision: 1802100

URL: http://svn.apache.org/viewvc?rev=1802100&view=rev
Log:
OAK-6081 - Indexing tooling via oak-run

Added:
    jackrabbit/site/live/oak/docs/features/oak-run-nodestore-connection-options.html   (with props)
    jackrabbit/site/live/oak/docs/query/oak-run-indexing.html   (with props)
    jackrabbit/site/live/oak/docs/query/pre-extract-text.html   (with props)

Added: jackrabbit/site/live/oak/docs/features/oak-run-nodestore-connection-options.html
URL: http://svn.apache.org/viewvc/jackrabbit/site/live/oak/docs/features/oak-run-nodestore-connection-options.html?rev=1802100&view=auto
==============================================================================
--- jackrabbit/site/live/oak/docs/features/oak-run-nodestore-connection-options.html (added)
+++ jackrabbit/site/live/oak/docs/features/oak-run-nodestore-connection-options.html Mon Jul 17 07:23:28 2017
@@ -0,0 +1,301 @@
+<!DOCTYPE html>
+<!--
+ | Generated by Apache Maven Doxia Site Renderer 1.7.4 at 2017-07-03 
+ | Rendered using Apache Maven Fluido Skin 1.6
+-->
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
+  <head>
+    <meta charset="UTF-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+    <meta name="Date-Revision-yyyymmdd" content="20170703" />
+    <meta http-equiv="Content-Language" content="en" />
+    <title>Jackrabbit Oak &#x2013; Oak Run NodeStore Connection</title>
+    <link rel="stylesheet" href="../css/apache-maven-fluido-1.6.min.css" />
+    <link rel="stylesheet" href="../css/site.css" />
+    <link rel="stylesheet" href="../css/print.css" media="print" />
+      <script type="text/javascript" src="../js/apache-maven-fluido-1.6.min.js"></script>
+      </head>
+    <body class="topBarEnabled">
+                  <a href="https://github.com/apache/jackrabbit-oak">
+      <img style="position: absolute; top: 0; right: 0; border: 0; z-index: 10000;"
+        src="https://s3.amazonaws.com/github/ribbons/forkme_right_red_aa0000.png"
+        alt="Fork me on GitHub">
+    </a>
+      <div id="topbar" class="navbar navbar-fixed-top ">
+      <div class="navbar-inner">
+        <div class="container-fluid">
+        <a data-target=".nav-collapse" data-toggle="collapse" class="btn btn-navbar">
+          <span class="icon-bar"></span>
+          <span class="icon-bar"></span>
+          <span class="icon-bar"></span>
+        </a>
+<a class="brand" href="../"  title="Oak logo"><img src="../oak_logo.png" alt="Oak logo" />
+</a>
+            <ul class="nav">
+        <li class="dropdown">
+        <a href="#" class="dropdown-toggle" data-toggle="dropdown">Overview <b class="caret"></b></a>
+        <ul class="dropdown-menu">
+            <li><a href="../index.html" title="Jackrabbit Oak">Jackrabbit Oak</a></li>
+            <li><a href="../license.html" title="License">License</a></li>
+            <li><a href="../downloads.html" title="Downloads">Downloads</a></li>
+            <li><a href="../articles.html" title="Articles">Articles</a></li>
+        </ul>
+      </li>
+        <li class="dropdown">
+        <a href="#" class="dropdown-toggle" data-toggle="dropdown">Concepts and Architecture <b class="caret"></b></a>
+        <ul class="dropdown-menu">
+            <li><a href="../architecture/overview.html" title="Overview">Overview</a></li>
+            <li><a href="../architecture/nodestate.html" title="The Node State Model">The Node State Model</a></li>
+        </ul>
+      </li>
+        <li class="dropdown">
+        <a href="#" class="dropdown-toggle" data-toggle="dropdown">Main APIs <b class="caret"></b></a>
+        <ul class="dropdown-menu">
+            <li><a href="http://www.day.com/specs/jcr/2.0/index.html" title="JCR API">JCR API</a></li>
+            <li><a href="../oak_api/overview.html" title="Oak API">Oak API</a></li>
+        </ul>
+      </li>
+        <li class="dropdown">
+        <a href="#" class="dropdown-toggle" data-toggle="dropdown">Features and Plugins <b class="caret"></b></a>
+        <ul class="dropdown-menu">
+            <li class="dropdown-submenu">
+<a href="../nodestore/overview.html" title="Node Storage">Node Storage</a>
+              <ul class="dropdown-menu">
+                  <li><a href="../nodestore/documentmk.html" title="Document NodeStore">Document NodeStore</a></li>
+                  <li><a href="../nodestore/segment/overview.html" title="Segment NodeStore">Segment NodeStore</a></li>
+              </ul>
+            </li>
+            <li><a href="../plugins/blobstore.html" title="Blob Storage">Blob Storage</a></li>
+            <li class="dropdown-submenu">
+<a href="../query/query.html" title="Query">Query</a>
+              <ul class="dropdown-menu">
+                  <li><a href="../query/query-engine.html" title="Query Engine">Query Engine</a></li>
+                  <li><a href="../query/query-troubleshooting.html" title="Troubleshooting">Troubleshooting</a></li>
+                  <li><a href="../query/indexing.html" title="Indexing">Indexing</a></li>
+                  <li><a href="../query/lucene.html" title="Lucene Index">Lucene Index</a></li>
+                  <li><a href="../query/property-index.html" title="Property Index">Property Index</a></li>
+                  <li><a href="../query/solr.html" title="Solr Index">Solr Index</a></li>
+              </ul>
+            </li>
+            <li><a href="../security/overview.html" title="Security">Security</a></li>
+            <li><a href="../features/atomic-counter.html" title="Atomic Counter">Atomic Counter</a></li>
+            <li><a href="../features/observation.html" title="Observation">Observation</a></li>
+        </ul>
+      </li>
+        <li class="dropdown">
+        <a href="#" class="dropdown-toggle" data-toggle="dropdown">Using Oak <b class="caret"></b></a>
+        <ul class="dropdown-menu">
+            <li><a href="../use_getting_started.html" title="Getting Started">Getting Started</a></li>
+            <li><a href="../construct.html" title="Repository Construction">Repository Construction</a></li>
+            <li><a href="../osgi_config.html" title="Configuring Oak">Configuring Oak</a></li>
+            <li><a href="../command_line.html" title="Command Line Tools">Command Line Tools</a></li>
+            <li><a href="../migration.html" title="Migration">Migration</a></li>
+            <li><a href="../differences.html" title="Differences to Jackrabbit 2">Differences to Jackrabbit 2</a></li>
+            <li><a href="../known_issues.html" title="Known Issues">Known Issues</a></li>
+            <li><a href="../dos_and_donts.html" title="Dos and Don'ts">Dos and Don'ts</a></li>
+            <li><a href="../coldstandby/coldstandby.html" title="Cold Standby">Cold Standby</a></li>
+            <li><a href="../FAQ.html" title="FAQ">FAQ</a></li>
+        </ul>
+      </li>
+        <li class="dropdown">
+        <a href="#" class="dropdown-toggle" data-toggle="dropdown">Developing Oak <b class="caret"></b></a>
+        <ul class="dropdown-menu">
+            <li><a href="../dev_getting_started.html" title="Getting Started">Getting Started</a></li>
+            <li><a href="../participating.html" title="Participating">Participating</a></li>
+            <li><a href="../developing-with-git.html" title="Developing with Git">Developing with Git</a></li>
+            <li><a href="../diagnostic-builds.html" title="Cutting diagnostic builds">Cutting diagnostic builds</a></li>
+            <li><a href="../attribution.html" title="Attribution">Attribution</a></li>
+            <li><a href="../release-schedule.html" title="Release Schedule">Release Schedule</a></li>
+        </ul>
+      </li>
+        <li class="dropdown">
+        <a href="#" class="dropdown-toggle" data-toggle="dropdown">Links <b class="caret"></b></a>
+        <ul class="dropdown-menu">
+            <li><a href="http://jackrabbit.apache.org/oak" title="Apache Jackrabbit Oak">Apache Jackrabbit Oak</a></li>
+            <li><a href="http://jackrabbit.apache.org/" title="Apache Jackrabbit">Apache Jackrabbit</a></li>
+        </ul>
+      </li>
+              </ul>
+            </div>
+        </div>
+      </div>
+    </div>
+    <div class="container-fluid">
+      <div id="banner">
+        <div class="pull-left"><div id="bannerLeft"><h2>Oak Documentation</h2>
+</div>
+</div>
+        <div class="pull-right"></div>
+        <div class="clear"><hr/></div>
+      </div>
+
+      <div id="breadcrumbs">
+        <ul class="breadcrumb">
+        <li id="publishDate">Last Published: 2017-07-03<span class="divider">|</span>
+</li>
+          <li id="projectVersion">Version: 1.8-SNAPSHOT</li>
+        </ul>
+      </div>
+      <div class="row-fluid">
+        <div id="leftColumn" class="span2">
+          <div class="well sidebar-nav">
+<ul class="nav nav-list">
+          <li class="nav-header">Overview</li>
+    <li><a href="../index.html" title="Jackrabbit Oak"><span class="none"></span>Jackrabbit Oak</a>  </li>
+    <li><a href="../license.html" title="License"><span class="none"></span>License</a>  </li>
+    <li><a href="../downloads.html" title="Downloads"><span class="none"></span>Downloads</a>  </li>
+    <li><a href="../articles.html" title="Articles"><span class="none"></span>Articles</a>  </li>
+          <li class="nav-header">Concepts and Architecture</li>
+    <li><a href="../architecture/overview.html" title="Overview"><span class="none"></span>Overview</a>  </li>
+    <li><a href="../architecture/nodestate.html" title="The Node State Model"><span class="none"></span>The Node State Model</a>  </li>
+          <li class="nav-header">Main APIs</li>
+    <li><a href="http://www.day.com/specs/jcr/2.0/index.html" class="externalLink" title="JCR API"><span class="none"></span>JCR API</a>  </li>
+    <li><a href="../oak_api/overview.html" title="Oak API"><span class="none"></span>Oak API</a>  </li>
+          <li class="nav-header">Features and Plugins</li>
+    <li><a href="../nodestore/overview.html" title="Node Storage"><span class="icon-chevron-down"></span>Node Storage</a>
+      <ul class="nav nav-list">
+    <li><a href="../nodestore/documentmk.html" title="Document NodeStore"><span class="icon-chevron-down"></span>Document NodeStore</a>
+      <ul class="nav nav-list">
+    <li><a href="../nodestore/document/node-bundling.html" title="Node Bundling"><span class="none"></span>Node Bundling</a>  </li>
+    <li><a href="../nodestore/document/secondary-store.html" title="Secondary Store"><span class="none"></span>Secondary Store</a>  </li>
+    <li><a href="../nodestore/persistent-cache.html" title="Persistent Cache"><span class="none"></span>Persistent Cache</a>  </li>
+    <li><a href="../clustering.html" title="Clustering"><span class="none"></span>Clustering</a>  </li>
+      </ul>
+  </li>
+    <li><a href="../nodestore/segment/overview.html" title="Segment NodeStore"><span class="none"></span>Segment NodeStore</a>  </li>
+      </ul>
+  </li>
+    <li><a href="../plugins/blobstore.html" title="Blob Storage"><span class="none"></span>Blob Storage</a>  </li>
+    <li><a href="../query/query.html" title="Query"><span class="icon-chevron-down"></span>Query</a>
+      <ul class="nav nav-list">
+    <li><a href="../query/query-engine.html" title="Query Engine"><span class="none"></span>Query Engine</a>  </li>
+    <li><a href="../query/query-troubleshooting.html" title="Troubleshooting"><span class="none"></span>Troubleshooting</a>  </li>
+    <li><a href="../query/indexing.html" title="Indexing"><span class="none"></span>Indexing</a>  </li>
+    <li><a href="../query/lucene.html" title="Lucene Index"><span class="none"></span>Lucene Index</a>  </li>
+    <li><a href="../query/property-index.html" title="Property Index"><span class="none"></span>Property Index</a>  </li>
+    <li><a href="../query/solr.html" title="Solr Index"><span class="none"></span>Solr Index</a>  </li>
+      </ul>
+  </li>
+    <li><a href="../security/overview.html" title="Security"><span class="none"></span>Security</a>  </li>
+    <li><a href="../features/atomic-counter.html" title="Atomic Counter"><span class="none"></span>Atomic Counter</a>  </li>
+    <li><a href="../features/observation.html" title="Observation"><span class="none"></span>Observation</a>  </li>
+          <li class="nav-header">Using Oak</li>
+    <li><a href="../use_getting_started.html" title="Getting Started"><span class="none"></span>Getting Started</a>  </li>
+    <li><a href="../construct.html" title="Repository Construction"><span class="none"></span>Repository Construction</a>  </li>
+    <li><a href="../osgi_config.html" title="Configuring Oak"><span class="none"></span>Configuring Oak</a>  </li>
+    <li><a href="../command_line.html" title="Command Line Tools"><span class="none"></span>Command Line Tools</a>  </li>
+    <li><a href="../migration.html" title="Migration"><span class="none"></span>Migration</a>  </li>
+    <li><a href="../differences.html" title="Differences to Jackrabbit 2"><span class="none"></span>Differences to Jackrabbit 2</a>  </li>
+    <li><a href="../known_issues.html" title="Known Issues"><span class="none"></span>Known Issues</a>  </li>
+    <li><a href="../dos_and_donts.html" title="Dos and Don'ts"><span class="none"></span>Dos and Don'ts</a>  </li>
+    <li><a href="../coldstandby/coldstandby.html" title="Cold Standby"><span class="none"></span>Cold Standby</a>  </li>
+    <li><a href="../FAQ.html" title="FAQ"><span class="none"></span>FAQ</a>  </li>
+          <li class="nav-header">Developing Oak</li>
+    <li><a href="../dev_getting_started.html" title="Getting Started"><span class="none"></span>Getting Started</a>  </li>
+    <li><a href="../participating.html" title="Participating"><span class="none"></span>Participating</a>  </li>
+    <li><a href="../developing-with-git.html" title="Developing with Git"><span class="none"></span>Developing with Git</a>  </li>
+    <li><a href="../diagnostic-builds.html" title="Cutting diagnostic builds"><span class="none"></span>Cutting diagnostic builds</a>  </li>
+    <li><a href="../attribution.html" title="Attribution"><span class="none"></span>Attribution</a>  </li>
+    <li><a href="../release-schedule.html" title="Release Schedule"><span class="none"></span>Release Schedule</a>  </li>
+          <li class="nav-header">Links</li>
+    <li><a href="http://jackrabbit.apache.org/oak" class="externalLink" title="Apache Jackrabbit Oak"><span class="none"></span>Apache Jackrabbit Oak</a>  </li>
+    <li><a href="http://jackrabbit.apache.org/" class="externalLink" title="Apache Jackrabbit"><span class="none"></span>Apache Jackrabbit</a>  </li>
+  </ul>
+          <hr />
+          <div id="poweredBy">
+          <script type="text/javascript">asyncJs( 'https://apis.google.com/js/plusone.js' )</script>
+        <div class="g-plusone" data-href="http://jackrabbit.apache.org/oak/docs/" data-size="tall" ></div>
+                  <div class="clear"></div>
+              <div class="clear"></div>
+              <div class="clear"></div>
+              <div class="clear"></div>
+  <a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy"><img class="builtBy" alt="Built by Maven" src="../images/logos/maven-feather.png" /></a>
+              </div>
+          </div>
+        </div>
+        <div id="bodyColumn"  class="span10" >
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+  --><h1>Oak Run NodeStore Connection</h1>
+<p><tt>@since Oak 1.7.1</tt></p>
+<p>This page provide details around various options supported by some of the oak-run commands to connect to NodeStore repository. By default most of these commands (unless documented) would connect in read only mode.</p>
+<p>These options are supported by following command (See <a class="externalLink" href="https://issues.apache.org/jira/browse/OAK-6210">OAK-6210</a>)</p>
+
+<ul>
+  
+<li>console</li>
+  
+<li>index</li>
+  
+<li>tika</li>
+</ul>
+<p>Depending on your setup you would need to configure the NodeStore and BlobStore in use for commands to work. Some commands may not require the BlobStore details. Check the specific oak-run command help to see if access to BlobStore is required or not. </p>
+<div class="section">
+<h2><a name="NodeStore"></a>NodeStore</h2>
+<div class="section">
+<h3><a name="SegmentNodeStore"></a>SegmentNodeStore</h3>
+<p>To connect to SegmentNodeStore just specify the path to folder used by SegmentNodeStore for storing the repository content</p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">java -jar oak-run &lt;command&gt; /path/to/segmentstore
+</pre></div></div></div>
+<div class="section">
+<h3><a name="DocumentNodeStore_-_Mongo"></a>DocumentNodeStore - Mongo</h3>
+<p>To connect to Mongo specify the MongoURI</p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">java -jar oak-run &lt;command&gt; mongodb://server:port
+</pre></div></div>
+<p>It support some other options like cache size, cache distribution etc. Refer to help output via <tt>-h</tt> to see supported options</p></div>
+<div class="section">
+<h3><a name="DocumentNodeStore_-_RDB"></a>DocumentNodeStore - RDB</h3>
+<p>&#xab;TBD&#xbb;</p></div></div>
+<div class="section">
+<h2><a name="BlobStore"></a>BlobStore</h2>
+<div class="section">
+<h3><a name="FileDataStore"></a>FileDataStore</h3>
+<p>Specify the path to directory used by <tt>FileDataStore</tt> via <tt>--fds-path</tt> option</p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">java -jar oak-run &lt;command&gt; /path/to/segmentstore --fds-path=/path/to/fds
+</pre></div></div></div>
+<div class="section">
+<h3><a name="S3DataStore"></a>S3DataStore</h3>
+<p>Specify the path to config file which contains connection details related to S3 bucket to be used via <tt>-s3ds</tt> option</p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">java -jar oak-run &lt;command&gt; /path/to/segmentstore --s3ds=/path/to/S3DataStore.config
+</pre></div></div>
+<p>The file should be a valid config file as configured S3DataStore in OSGi setup for pid <tt>org.apache.jackrabbit.oak.plugins.blob.datastore.S3DataStore.config</tt>. </p>
+<p>Do change the <tt>path</tt> property to location based on system from where command is being used. If you are running the command on the setup where the Oak application is running then ensure that <tt>path</tt> is set to a different location.</p></div></div>
+        </div>
+      </div>
+    </div>
+    <hr/>
+    <footer>
+      <div class="container-fluid">
+        <div class="row-fluid">
+            <p>Copyright &copy;2012&#x2013;2017
+<a href="https://www.apache.org/">The Apache Software Foundation</a>.
+All rights reserved.</p>
+        </div>
+                          <div id="ohloh" class="pull-right">
+      <script type="text/javascript" src="https://www.ohloh.net/p/jackrabbit-oak/widgets/project_thin_badge.js"></script>
+    </div>
+        </div>
+    </footer>
+    </body>
+</html>
\ No newline at end of file

Propchange: jackrabbit/site/live/oak/docs/features/oak-run-nodestore-connection-options.html
------------------------------------------------------------------------------
    svn:eol-style = native

Added: jackrabbit/site/live/oak/docs/query/oak-run-indexing.html
URL: http://svn.apache.org/viewvc/jackrabbit/site/live/oak/docs/query/oak-run-indexing.html?rev=1802100&view=auto
==============================================================================
--- jackrabbit/site/live/oak/docs/query/oak-run-indexing.html (added)
+++ jackrabbit/site/live/oak/docs/query/oak-run-indexing.html Mon Jul 17 07:23:28 2017
@@ -0,0 +1,395 @@
+<!DOCTYPE html>
+<!--
+ | Generated by Apache Maven Doxia Site Renderer 1.7.4 at 2017-07-17 
+ | Rendered using Apache Maven Fluido Skin 1.6
+-->
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
+  <head>
+    <meta charset="UTF-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+    <meta name="Date-Revision-yyyymmdd" content="20170717" />
+    <meta http-equiv="Content-Language" content="en" />
+    <title>Jackrabbit Oak &#x2013; Oak Run Indexing</title>
+    <link rel="stylesheet" href="../css/apache-maven-fluido-1.6.min.css" />
+    <link rel="stylesheet" href="../css/site.css" />
+    <link rel="stylesheet" href="../css/print.css" media="print" />
+      <script type="text/javascript" src="../js/apache-maven-fluido-1.6.min.js"></script>
+      </head>
+    <body class="topBarEnabled">
+                  <a href="https://github.com/apache/jackrabbit-oak">
+      <img style="position: absolute; top: 0; right: 0; border: 0; z-index: 10000;"
+        src="https://s3.amazonaws.com/github/ribbons/forkme_right_red_aa0000.png"
+        alt="Fork me on GitHub">
+    </a>
+      <div id="topbar" class="navbar navbar-fixed-top ">
+      <div class="navbar-inner">
+        <div class="container-fluid">
+        <a data-target=".nav-collapse" data-toggle="collapse" class="btn btn-navbar">
+          <span class="icon-bar"></span>
+          <span class="icon-bar"></span>
+          <span class="icon-bar"></span>
+        </a>
+<a class="brand" href="../"  title="Oak logo"><img src="../oak_logo.png" alt="Oak logo" />
+</a>
+            <ul class="nav">
+        <li class="dropdown">
+        <a href="#" class="dropdown-toggle" data-toggle="dropdown">Overview <b class="caret"></b></a>
+        <ul class="dropdown-menu">
+            <li><a href="../index.html" title="Jackrabbit Oak">Jackrabbit Oak</a></li>
+            <li><a href="../license.html" title="License">License</a></li>
+            <li><a href="../downloads.html" title="Downloads">Downloads</a></li>
+            <li><a href="../articles.html" title="Articles">Articles</a></li>
+        </ul>
+      </li>
+        <li class="dropdown">
+        <a href="#" class="dropdown-toggle" data-toggle="dropdown">Concepts and Architecture <b class="caret"></b></a>
+        <ul class="dropdown-menu">
+            <li><a href="../architecture/overview.html" title="Overview">Overview</a></li>
+            <li><a href="../architecture/nodestate.html" title="The Node State Model">The Node State Model</a></li>
+        </ul>
+      </li>
+        <li class="dropdown">
+        <a href="#" class="dropdown-toggle" data-toggle="dropdown">Main APIs <b class="caret"></b></a>
+        <ul class="dropdown-menu">
+            <li><a href="http://www.day.com/specs/jcr/2.0/index.html" title="JCR API">JCR API</a></li>
+            <li><a href="../oak_api/overview.html" title="Oak API">Oak API</a></li>
+        </ul>
+      </li>
+        <li class="dropdown">
+        <a href="#" class="dropdown-toggle" data-toggle="dropdown">Features and Plugins <b class="caret"></b></a>
+        <ul class="dropdown-menu">
+            <li class="dropdown-submenu">
+<a href="../nodestore/overview.html" title="Node Storage">Node Storage</a>
+              <ul class="dropdown-menu">
+                  <li><a href="../nodestore/documentmk.html" title="Document NodeStore">Document NodeStore</a></li>
+                  <li><a href="../nodestore/segment/overview.html" title="Segment NodeStore">Segment NodeStore</a></li>
+              </ul>
+            </li>
+            <li><a href="../plugins/blobstore.html" title="Blob Storage">Blob Storage</a></li>
+            <li class="dropdown-submenu">
+<a href="../query/query.html" title="Query">Query</a>
+              <ul class="dropdown-menu">
+                  <li><a href="../query/query-engine.html" title="Query Engine">Query Engine</a></li>
+                  <li><a href="../query/query-troubleshooting.html" title="Troubleshooting">Troubleshooting</a></li>
+                  <li><a href="../query/indexing.html" title="Indexing">Indexing</a></li>
+                  <li><a href="../query/lucene.html" title="Lucene Index">Lucene Index</a></li>
+                  <li><a href="../query/property-index.html" title="Property Index">Property Index</a></li>
+                  <li><a href="../query/solr.html" title="Solr Index">Solr Index</a></li>
+              </ul>
+            </li>
+            <li><a href="../security/overview.html" title="Security">Security</a></li>
+            <li><a href="../features/atomic-counter.html" title="Atomic Counter">Atomic Counter</a></li>
+            <li><a href="../features/observation.html" title="Observation">Observation</a></li>
+        </ul>
+      </li>
+        <li class="dropdown">
+        <a href="#" class="dropdown-toggle" data-toggle="dropdown">Using Oak <b class="caret"></b></a>
+        <ul class="dropdown-menu">
+            <li><a href="../use_getting_started.html" title="Getting Started">Getting Started</a></li>
+            <li><a href="../construct.html" title="Repository Construction">Repository Construction</a></li>
+            <li><a href="../osgi_config.html" title="Configuring Oak">Configuring Oak</a></li>
+            <li><a href="../command_line.html" title="Command Line Tools">Command Line Tools</a></li>
+            <li><a href="../migration.html" title="Migration">Migration</a></li>
+            <li><a href="../differences.html" title="Differences to Jackrabbit 2">Differences to Jackrabbit 2</a></li>
+            <li><a href="../known_issues.html" title="Known Issues">Known Issues</a></li>
+            <li><a href="../dos_and_donts.html" title="Dos and Don'ts">Dos and Don'ts</a></li>
+            <li><a href="../coldstandby/coldstandby.html" title="Cold Standby">Cold Standby</a></li>
+            <li><a href="../FAQ.html" title="FAQ">FAQ</a></li>
+        </ul>
+      </li>
+        <li class="dropdown">
+        <a href="#" class="dropdown-toggle" data-toggle="dropdown">Developing Oak <b class="caret"></b></a>
+        <ul class="dropdown-menu">
+            <li><a href="../dev_getting_started.html" title="Getting Started">Getting Started</a></li>
+            <li><a href="../participating.html" title="Participating">Participating</a></li>
+            <li><a href="../developing-with-git.html" title="Developing with Git">Developing with Git</a></li>
+            <li><a href="../diagnostic-builds.html" title="Cutting diagnostic builds">Cutting diagnostic builds</a></li>
+            <li><a href="../attribution.html" title="Attribution">Attribution</a></li>
+            <li><a href="../release-schedule.html" title="Release Schedule">Release Schedule</a></li>
+        </ul>
+      </li>
+        <li class="dropdown">
+        <a href="#" class="dropdown-toggle" data-toggle="dropdown">Links <b class="caret"></b></a>
+        <ul class="dropdown-menu">
+            <li><a href="http://jackrabbit.apache.org/oak" title="Apache Jackrabbit Oak">Apache Jackrabbit Oak</a></li>
+            <li><a href="http://jackrabbit.apache.org/" title="Apache Jackrabbit">Apache Jackrabbit</a></li>
+        </ul>
+      </li>
+              </ul>
+            </div>
+        </div>
+      </div>
+    </div>
+    <div class="container-fluid">
+      <div id="banner">
+        <div class="pull-left"><div id="bannerLeft"><h2>Oak Documentation</h2>
+</div>
+</div>
+        <div class="pull-right"></div>
+        <div class="clear"><hr/></div>
+      </div>
+
+      <div id="breadcrumbs">
+        <ul class="breadcrumb">
+        <li id="publishDate">Last Published: 2017-07-17<span class="divider">|</span>
+</li>
+          <li id="projectVersion">Version: 1.8-SNAPSHOT</li>
+        </ul>
+      </div>
+      <div class="row-fluid">
+        <div id="leftColumn" class="span2">
+          <div class="well sidebar-nav">
+<ul class="nav nav-list">
+          <li class="nav-header">Overview</li>
+    <li><a href="../index.html" title="Jackrabbit Oak"><span class="none"></span>Jackrabbit Oak</a>  </li>
+    <li><a href="../license.html" title="License"><span class="none"></span>License</a>  </li>
+    <li><a href="../downloads.html" title="Downloads"><span class="none"></span>Downloads</a>  </li>
+    <li><a href="../articles.html" title="Articles"><span class="none"></span>Articles</a>  </li>
+          <li class="nav-header">Concepts and Architecture</li>
+    <li><a href="../architecture/overview.html" title="Overview"><span class="none"></span>Overview</a>  </li>
+    <li><a href="../architecture/nodestate.html" title="The Node State Model"><span class="none"></span>The Node State Model</a>  </li>
+          <li class="nav-header">Main APIs</li>
+    <li><a href="http://www.day.com/specs/jcr/2.0/index.html" class="externalLink" title="JCR API"><span class="none"></span>JCR API</a>  </li>
+    <li><a href="../oak_api/overview.html" title="Oak API"><span class="none"></span>Oak API</a>  </li>
+          <li class="nav-header">Features and Plugins</li>
+    <li><a href="../nodestore/overview.html" title="Node Storage"><span class="icon-chevron-down"></span>Node Storage</a>
+      <ul class="nav nav-list">
+    <li><a href="../nodestore/documentmk.html" title="Document NodeStore"><span class="icon-chevron-down"></span>Document NodeStore</a>
+      <ul class="nav nav-list">
+    <li><a href="../nodestore/document/node-bundling.html" title="Node Bundling"><span class="none"></span>Node Bundling</a>  </li>
+    <li><a href="../nodestore/document/secondary-store.html" title="Secondary Store"><span class="none"></span>Secondary Store</a>  </li>
+    <li><a href="../nodestore/persistent-cache.html" title="Persistent Cache"><span class="none"></span>Persistent Cache</a>  </li>
+    <li><a href="../clustering.html" title="Clustering"><span class="none"></span>Clustering</a>  </li>
+      </ul>
+  </li>
+    <li><a href="../nodestore/segment/overview.html" title="Segment NodeStore"><span class="none"></span>Segment NodeStore</a>  </li>
+      </ul>
+  </li>
+    <li><a href="../plugins/blobstore.html" title="Blob Storage"><span class="none"></span>Blob Storage</a>  </li>
+    <li><a href="../query/query.html" title="Query"><span class="icon-chevron-down"></span>Query</a>
+      <ul class="nav nav-list">
+    <li><a href="../query/query-engine.html" title="Query Engine"><span class="none"></span>Query Engine</a>  </li>
+    <li><a href="../query/query-troubleshooting.html" title="Troubleshooting"><span class="none"></span>Troubleshooting</a>  </li>
+    <li><a href="../query/indexing.html" title="Indexing"><span class="none"></span>Indexing</a>  </li>
+    <li><a href="../query/lucene.html" title="Lucene Index"><span class="none"></span>Lucene Index</a>  </li>
+    <li><a href="../query/property-index.html" title="Property Index"><span class="none"></span>Property Index</a>  </li>
+    <li><a href="../query/solr.html" title="Solr Index"><span class="none"></span>Solr Index</a>  </li>
+      </ul>
+  </li>
+    <li><a href="../security/overview.html" title="Security"><span class="none"></span>Security</a>  </li>
+    <li><a href="../features/atomic-counter.html" title="Atomic Counter"><span class="none"></span>Atomic Counter</a>  </li>
+    <li><a href="../features/observation.html" title="Observation"><span class="none"></span>Observation</a>  </li>
+          <li class="nav-header">Using Oak</li>
+    <li><a href="../use_getting_started.html" title="Getting Started"><span class="none"></span>Getting Started</a>  </li>
+    <li><a href="../construct.html" title="Repository Construction"><span class="none"></span>Repository Construction</a>  </li>
+    <li><a href="../osgi_config.html" title="Configuring Oak"><span class="none"></span>Configuring Oak</a>  </li>
+    <li><a href="../command_line.html" title="Command Line Tools"><span class="none"></span>Command Line Tools</a>  </li>
+    <li><a href="../migration.html" title="Migration"><span class="none"></span>Migration</a>  </li>
+    <li><a href="../differences.html" title="Differences to Jackrabbit 2"><span class="none"></span>Differences to Jackrabbit 2</a>  </li>
+    <li><a href="../known_issues.html" title="Known Issues"><span class="none"></span>Known Issues</a>  </li>
+    <li><a href="../dos_and_donts.html" title="Dos and Don'ts"><span class="none"></span>Dos and Don'ts</a>  </li>
+    <li><a href="../coldstandby/coldstandby.html" title="Cold Standby"><span class="none"></span>Cold Standby</a>  </li>
+    <li><a href="../FAQ.html" title="FAQ"><span class="none"></span>FAQ</a>  </li>
+          <li class="nav-header">Developing Oak</li>
+    <li><a href="../dev_getting_started.html" title="Getting Started"><span class="none"></span>Getting Started</a>  </li>
+    <li><a href="../participating.html" title="Participating"><span class="none"></span>Participating</a>  </li>
+    <li><a href="../developing-with-git.html" title="Developing with Git"><span class="none"></span>Developing with Git</a>  </li>
+    <li><a href="../diagnostic-builds.html" title="Cutting diagnostic builds"><span class="none"></span>Cutting diagnostic builds</a>  </li>
+    <li><a href="../attribution.html" title="Attribution"><span class="none"></span>Attribution</a>  </li>
+    <li><a href="../release-schedule.html" title="Release Schedule"><span class="none"></span>Release Schedule</a>  </li>
+          <li class="nav-header">Links</li>
+    <li><a href="http://jackrabbit.apache.org/oak" class="externalLink" title="Apache Jackrabbit Oak"><span class="none"></span>Apache Jackrabbit Oak</a>  </li>
+    <li><a href="http://jackrabbit.apache.org/" class="externalLink" title="Apache Jackrabbit"><span class="none"></span>Apache Jackrabbit</a>  </li>
+  </ul>
+          <hr />
+          <div id="poweredBy">
+          <script type="text/javascript">asyncJs( 'https://apis.google.com/js/plusone.js' )</script>
+        <div class="g-plusone" data-href="http://jackrabbit.apache.org/oak/docs/" data-size="tall" ></div>
+                  <div class="clear"></div>
+              <div class="clear"></div>
+              <div class="clear"></div>
+              <div class="clear"></div>
+  <a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy"><img class="builtBy" alt="Built by Maven" src="../images/logos/maven-feather.png" /></a>
+              </div>
+          </div>
+        </div>
+        <div id="bodyColumn"  class="span10" >
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+  --><h1>Oak Run Indexing</h1>
+<p><tt>@since Oak 1.7.0</tt></p>
+<p><b>Work in progress. Not to be used on production setups</b></p>
+<p>With Oak 1.7 we have added some tooling as part of oak-run <tt>index</tt> command. Below are details around various operations supported by this command.</p>
+<p>The <tt>index</tt> command supports connecting to different NodeStores via various options which are documented <a href="../features/oak-run-nodestore-connection-options.html">here</a>. Example below assume a setup consisting of SegmentNodeStore and FileDataStore. Depending on setup use the appropriate connection options.</p>
+<p>By default the tool would generate output file in directory <tt>indexing-result</tt> which is referred to as output directory.</p>
+<p>Unless specified all operations connect to the repository in read only mode</p>
+<div class="section">
+<h2><a name="Common_Options"></a>Common Options</h2>
+<p>All the commands support following common options</p>
+
+<ol style="list-style-type: decimal">
+  
+<li><tt>--index-paths</tt> - Comma separated list of index paths for which the selected operations need to be performed. If  not specified then the operation would be performed against all the indexes.</li>
+</ol>
+<p>Also refer to help output via <tt>-h</tt> command for some other options</p></div>
+<div class="section">
+<h2><a name="Generate_Index_Info"></a>Generate Index Info</h2>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">java -jar oak-run*.jar index --fds-path=/path/to/datastore  /path/to/segmentstore/ --index-info 
+</pre></div></div>
+<p>Generates a report consisting of various stats related to indexes present in the given repository. The generated report is stored by default in <tt>&lt;output dir&gt;/index-info.txt</tt></p>
+<p>Supported for all index types</p></div>
+<div class="section">
+<h2><a name="Dump_Index_Definitions"></a>Dump Index Definitions</h2>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">java -jar oak-run*.jar index --fds-path=/path/to/datastore  /path/to/segmentstore/ --index-definitions
+</pre></div></div>
+<p><tt>--index-definitions</tt> operation dumps the index definition in json format to a file <tt>&lt;output dir&gt;/index-definitions.json</tt>. The json file contains index definitions keyed against the index paths</p>
+<p>Supported for all index types</p></div>
+<div class="section">
+<h2><a name="Dump_Index_Data"></a>Dump Index Data</h2>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">java -jar oak-run*.jar index --fds-path=/path/to/datastore  /path/to/segmentstore/ --index-dump
+</pre></div></div>
+<p><tt>--index-dump</tt> operation dumps the index content in output directory. The output directory would contain one folder for each index. Each folder would have a property file <tt>index-details.txt</tt> which contains <tt>indexPath</tt></p>
+<p>Supported for only Lucene indexes.</p></div>
+<div class="section">
+<h2><a name="Index_Consistency_Check"></a>Index Consistency Check</h2>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">java -jar oak-run*.jar index --fds-path=/path/to/datastore  /path/to/segmentstore/ --index-consistency-check
+</pre></div></div>
+<p><tt>--index-consistency-check</tt> operation performs index consistency check against various indexes. It supports 2 level</p>
+
+<ul>
+  
+<li>Level 1 - Specified as <tt>--index-consistency-check=1</tt>. Performs a basic check to determine if all blobs referred in index  are valid</li>
+  
+<li>Level 2 - Specified as <tt>--index-consistency-check=2</tt>. Performs a more through check to determine if all index files  are valid and no corruption has happened. This check is slower</li>
+</ul>
+<p>It would generate a report in <tt>&lt;output dir&gt;/index-consistency-check-report.txt</tt></p>
+<p>Supported for only Lucene indexes.</p></div>
+<div class="section">
+<h2><a name="Reindex"></a>Reindex</h2>
+<p>The reindex operation supports 2 modes of index</p>
+
+<ul>
+  
+<li>Out-of-band indexing - Here oak-run would connect to repository in read only mode. It would require certain manual steps</li>
+  
+<li>Online Indexing - Here oak-run would connect to repository in <tt>--read-write</tt> mode</li>
+</ul>
+<p>Supported for only Lucene indexes.</p>
+<p>If the indexes being reindex have fulltext indexing enabled then refer to <a href="#tika-setup">Tika Setup</a> for steps on how to adapt the command to include Tika support for text extraction</p>
+<div class="section">
+<h3><a name="A_-_out-of-band_indexing"></a>A - out-of-band indexing</h3>
+<p>Out of band indexing has following phases</p>
+
+<ol style="list-style-type: decimal">
+  
+<li>Get checkpoint issued</li>
+  
+<li>Perform indexing with read only connection to NodeStore upto checkpoint state</li>
+  
+<li>Import the generated indexes</li>
+  
+<li>Complete the increment indexing from checkpoint state to current head</li>
+</ol>
+<div class="section">
+<h4><a name="Step_1_-_Text_PreExtraction"></a>Step 1 - Text PreExtraction</h4>
+<p>If the index being reindexed involves fulltext index and the repository has binary content then its recommended that first <a href="pre-extract-text.html">text pre-extraction</a> is performed. This ensures that costly operation around text extraction is done prior to actual indexing so that actual indexing does not do text extraction in critical path</p></div>
+<div class="section">
+<h4><a name="Step_2_-_Create_Checkpoint"></a>Step 2 - Create Checkpoint</h4>
+<p>Go to <tt>CheckpointMBean</tt> and create a checkpoint with lifetime of 1 month. &#xab;TBD&#xbb;</p></div>
+<div class="section">
+<h4><a name="Step_3_-_Perform_Reindex"></a>Step 3 - Perform Reindex</h4>
+<p>In this step we perform the actual indexing via oak-run where it connects to repository in read only mode. </p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint"> java -jar oak-run*.jar index --fds-path=/path/to/datastore  /path/to/segmentstore/ --reindex --index-paths=/oak:index/indexName
+</pre></div></div>
+<p>Here following options can be used</p>
+
+<ul>
+  
+<li><tt>--pre-extracted-text-dir</tt> - Directory path containing pre extracted text generated via step #1</li>
+  
+<li><tt>--index-paths</tt> - This command requires an explicit set of index paths which need to be indexed</li>
+  
+<li><tt>--checkpoint</tt> - The checkpoint up to which the index is updated, when indexing in read only mode. For  testing purpose, it can be set to &#x2018;head&#x2019; to indicate that the head state should be used.</li>
+</ul></div>
+<div class="section">
+<h4><a name="Step_4_-_Import_the_index"></a>Step 4 - Import the index</h4>
+<p>As a last step we need to import the index back in the repository. This can be done in one of the following ways</p>
+<div class="section">
+<h5><a name="a4.1_-_Via_oak-run"></a>4.1 - Via oak-run</h5>
+<p>In this mode we import the index using oak-run</p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">java -jar oak-run*.jar index --index-import --read-write --index-import-dir=&lt;index dir&gt; /path/to/segmentstore
+</pre></div></div>
+<p>Here &#x201c;index dir&#x201d; is the directory which contains the index files created in step #3. Check the logs from previous command for the directory path.</p>
+<p>This mode should only be used when repository is from Oak version 1.7+ as oak-run connects to the repository in read-write mode.</p></div>
+<div class="section">
+<h5><a name="a4.2_-_Via_IndexerMBean"></a>4.2 - Via IndexerMBean</h5>
+<p>In this mode we import the index using JMX. Looks for <tt>IndexerMBean</tt> and then import the index directory using the <tt>importIndex</tt> operation</p></div>
+<div class="section">
+<h5><a name="a4.3_-_Via_script"></a>4.3 - Via script</h5>
+<p>TODO - Provide a way to import the data on older setup using some script</p></div></div></div>
+<div class="section">
+<h3><a name="B_-_Online_indexing"></a>B - Online indexing</h3>
+<p>Online indexing automates some of the manual steps which are required for out-of-band indexing. </p>
+<p>This mode should only be used when repository is from Oak version 1.7+ as oak-run connects to the repository in read-write mode.</p>
+<div class="section">
+<h4><a name="Step_1_-_Text_PreExtraction"></a>Step 1 - Text PreExtraction</h4>
+<p>This is same as in out-of-band indexing</p></div>
+<div class="section">
+<h4><a name="Step_2_-_Perform_reindexing"></a>Step 2 - Perform reindexing</h4>
+<p>In this step we configure oak-run to connect to repository in read-write mode and let it perform all other steps i.e checkpoint creation, indexing and import</p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">java -jar oak-run*.jar index --reindex --index-paths=/oak:index/lucene --read-write /path/to/segmentstore
+</pre></div></div></div></div>
+<div class="section">
+<h3><a name="Tika_Setup"></a><a name="tika-setup"></a> Tika Setup</h3>
+<p>If the indexes being reindex have fulltext indexing enabled then you need to include Tika library in classpath. This is required even if pre extraction is used so as to ensure that any new binary added after pre-extraction is done can be indexed.</p>
+<p>First download the <a class="externalLink" href="https://tika.apache.org/download.html">tika-app</a> jar from Tika downloads. You should be able to use 1.15 version with Oak 1.7.4 jar.</p>
+<p>Then modify the index command like below. The rest of arguments remain same as documented before.</p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">java -cp oak-run.jar:tika-app-1.15.jar org.apache.jackrabbit.oak.run.Main index
+</pre></div></div></div></div>
+        </div>
+      </div>
+    </div>
+    <hr/>
+    <footer>
+      <div class="container-fluid">
+        <div class="row-fluid">
+            <p>Copyright &copy;2012&#x2013;2017
+<a href="https://www.apache.org/">The Apache Software Foundation</a>.
+All rights reserved.</p>
+        </div>
+                          <div id="ohloh" class="pull-right">
+      <script type="text/javascript" src="https://www.ohloh.net/p/jackrabbit-oak/widgets/project_thin_badge.js"></script>
+    </div>
+        </div>
+    </footer>
+    </body>
+</html>
\ No newline at end of file

Propchange: jackrabbit/site/live/oak/docs/query/oak-run-indexing.html
------------------------------------------------------------------------------
    svn:eol-style = native

Added: jackrabbit/site/live/oak/docs/query/pre-extract-text.html
URL: http://svn.apache.org/viewvc/jackrabbit/site/live/oak/docs/query/pre-extract-text.html?rev=1802100&view=auto
==============================================================================
--- jackrabbit/site/live/oak/docs/query/pre-extract-text.html (added)
+++ jackrabbit/site/live/oak/docs/query/pre-extract-text.html Mon Jul 17 07:23:28 2017
@@ -0,0 +1,338 @@
+<!DOCTYPE html>
+<!--
+ | Generated by Apache Maven Doxia Site Renderer 1.7.4 at 2017-07-17 
+ | Rendered using Apache Maven Fluido Skin 1.6
+-->
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
+  <head>
+    <meta charset="UTF-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+    <meta name="Date-Revision-yyyymmdd" content="20170717" />
+    <meta http-equiv="Content-Language" content="en" />
+    <title>Jackrabbit Oak &#x2013; Pre-Extracting Text from Binaries</title>
+    <link rel="stylesheet" href="../css/apache-maven-fluido-1.6.min.css" />
+    <link rel="stylesheet" href="../css/site.css" />
+    <link rel="stylesheet" href="../css/print.css" media="print" />
+      <script type="text/javascript" src="../js/apache-maven-fluido-1.6.min.js"></script>
+      </head>
+    <body class="topBarEnabled">
+                  <a href="https://github.com/apache/jackrabbit-oak">
+      <img style="position: absolute; top: 0; right: 0; border: 0; z-index: 10000;"
+        src="https://s3.amazonaws.com/github/ribbons/forkme_right_red_aa0000.png"
+        alt="Fork me on GitHub">
+    </a>
+      <div id="topbar" class="navbar navbar-fixed-top ">
+      <div class="navbar-inner">
+        <div class="container-fluid">
+        <a data-target=".nav-collapse" data-toggle="collapse" class="btn btn-navbar">
+          <span class="icon-bar"></span>
+          <span class="icon-bar"></span>
+          <span class="icon-bar"></span>
+        </a>
+<a class="brand" href="../"  title="Oak logo"><img src="../oak_logo.png" alt="Oak logo" />
+</a>
+            <ul class="nav">
+        <li class="dropdown">
+        <a href="#" class="dropdown-toggle" data-toggle="dropdown">Overview <b class="caret"></b></a>
+        <ul class="dropdown-menu">
+            <li><a href="../index.html" title="Jackrabbit Oak">Jackrabbit Oak</a></li>
+            <li><a href="../license.html" title="License">License</a></li>
+            <li><a href="../downloads.html" title="Downloads">Downloads</a></li>
+            <li><a href="../articles.html" title="Articles">Articles</a></li>
+        </ul>
+      </li>
+        <li class="dropdown">
+        <a href="#" class="dropdown-toggle" data-toggle="dropdown">Concepts and Architecture <b class="caret"></b></a>
+        <ul class="dropdown-menu">
+            <li><a href="../architecture/overview.html" title="Overview">Overview</a></li>
+            <li><a href="../architecture/nodestate.html" title="The Node State Model">The Node State Model</a></li>
+        </ul>
+      </li>
+        <li class="dropdown">
+        <a href="#" class="dropdown-toggle" data-toggle="dropdown">Main APIs <b class="caret"></b></a>
+        <ul class="dropdown-menu">
+            <li><a href="http://www.day.com/specs/jcr/2.0/index.html" title="JCR API">JCR API</a></li>
+            <li><a href="../oak_api/overview.html" title="Oak API">Oak API</a></li>
+        </ul>
+      </li>
+        <li class="dropdown">
+        <a href="#" class="dropdown-toggle" data-toggle="dropdown">Features and Plugins <b class="caret"></b></a>
+        <ul class="dropdown-menu">
+            <li class="dropdown-submenu">
+<a href="../nodestore/overview.html" title="Node Storage">Node Storage</a>
+              <ul class="dropdown-menu">
+                  <li><a href="../nodestore/documentmk.html" title="Document NodeStore">Document NodeStore</a></li>
+                  <li><a href="../nodestore/segment/overview.html" title="Segment NodeStore">Segment NodeStore</a></li>
+              </ul>
+            </li>
+            <li><a href="../plugins/blobstore.html" title="Blob Storage">Blob Storage</a></li>
+            <li class="dropdown-submenu">
+<a href="../query/query.html" title="Query">Query</a>
+              <ul class="dropdown-menu">
+                  <li><a href="../query/query-engine.html" title="Query Engine">Query Engine</a></li>
+                  <li><a href="../query/query-troubleshooting.html" title="Troubleshooting">Troubleshooting</a></li>
+                  <li><a href="../query/indexing.html" title="Indexing">Indexing</a></li>
+                  <li><a href="../query/lucene.html" title="Lucene Index">Lucene Index</a></li>
+                  <li><a href="../query/property-index.html" title="Property Index">Property Index</a></li>
+                  <li><a href="../query/solr.html" title="Solr Index">Solr Index</a></li>
+              </ul>
+            </li>
+            <li><a href="../security/overview.html" title="Security">Security</a></li>
+            <li><a href="../features/atomic-counter.html" title="Atomic Counter">Atomic Counter</a></li>
+            <li><a href="../features/observation.html" title="Observation">Observation</a></li>
+        </ul>
+      </li>
+        <li class="dropdown">
+        <a href="#" class="dropdown-toggle" data-toggle="dropdown">Using Oak <b class="caret"></b></a>
+        <ul class="dropdown-menu">
+            <li><a href="../use_getting_started.html" title="Getting Started">Getting Started</a></li>
+            <li><a href="../construct.html" title="Repository Construction">Repository Construction</a></li>
+            <li><a href="../osgi_config.html" title="Configuring Oak">Configuring Oak</a></li>
+            <li><a href="../command_line.html" title="Command Line Tools">Command Line Tools</a></li>
+            <li><a href="../migration.html" title="Migration">Migration</a></li>
+            <li><a href="../differences.html" title="Differences to Jackrabbit 2">Differences to Jackrabbit 2</a></li>
+            <li><a href="../known_issues.html" title="Known Issues">Known Issues</a></li>
+            <li><a href="../dos_and_donts.html" title="Dos and Don'ts">Dos and Don'ts</a></li>
+            <li><a href="../coldstandby/coldstandby.html" title="Cold Standby">Cold Standby</a></li>
+            <li><a href="../FAQ.html" title="FAQ">FAQ</a></li>
+        </ul>
+      </li>
+        <li class="dropdown">
+        <a href="#" class="dropdown-toggle" data-toggle="dropdown">Developing Oak <b class="caret"></b></a>
+        <ul class="dropdown-menu">
+            <li><a href="../dev_getting_started.html" title="Getting Started">Getting Started</a></li>
+            <li><a href="../participating.html" title="Participating">Participating</a></li>
+            <li><a href="../developing-with-git.html" title="Developing with Git">Developing with Git</a></li>
+            <li><a href="../diagnostic-builds.html" title="Cutting diagnostic builds">Cutting diagnostic builds</a></li>
+            <li><a href="../attribution.html" title="Attribution">Attribution</a></li>
+            <li><a href="../release-schedule.html" title="Release Schedule">Release Schedule</a></li>
+        </ul>
+      </li>
+        <li class="dropdown">
+        <a href="#" class="dropdown-toggle" data-toggle="dropdown">Links <b class="caret"></b></a>
+        <ul class="dropdown-menu">
+            <li><a href="http://jackrabbit.apache.org/oak" title="Apache Jackrabbit Oak">Apache Jackrabbit Oak</a></li>
+            <li><a href="http://jackrabbit.apache.org/" title="Apache Jackrabbit">Apache Jackrabbit</a></li>
+        </ul>
+      </li>
+              </ul>
+            </div>
+        </div>
+      </div>
+    </div>
+    <div class="container-fluid">
+      <div id="banner">
+        <div class="pull-left"><div id="bannerLeft"><h2>Oak Documentation</h2>
+</div>
+</div>
+        <div class="pull-right"></div>
+        <div class="clear"><hr/></div>
+      </div>
+
+      <div id="breadcrumbs">
+        <ul class="breadcrumb">
+        <li id="publishDate">Last Published: 2017-07-17<span class="divider">|</span>
+</li>
+          <li id="projectVersion">Version: 1.8-SNAPSHOT</li>
+        </ul>
+      </div>
+      <div class="row-fluid">
+        <div id="leftColumn" class="span2">
+          <div class="well sidebar-nav">
+<ul class="nav nav-list">
+          <li class="nav-header">Overview</li>
+    <li><a href="../index.html" title="Jackrabbit Oak"><span class="none"></span>Jackrabbit Oak</a>  </li>
+    <li><a href="../license.html" title="License"><span class="none"></span>License</a>  </li>
+    <li><a href="../downloads.html" title="Downloads"><span class="none"></span>Downloads</a>  </li>
+    <li><a href="../articles.html" title="Articles"><span class="none"></span>Articles</a>  </li>
+          <li class="nav-header">Concepts and Architecture</li>
+    <li><a href="../architecture/overview.html" title="Overview"><span class="none"></span>Overview</a>  </li>
+    <li><a href="../architecture/nodestate.html" title="The Node State Model"><span class="none"></span>The Node State Model</a>  </li>
+          <li class="nav-header">Main APIs</li>
+    <li><a href="http://www.day.com/specs/jcr/2.0/index.html" class="externalLink" title="JCR API"><span class="none"></span>JCR API</a>  </li>
+    <li><a href="../oak_api/overview.html" title="Oak API"><span class="none"></span>Oak API</a>  </li>
+          <li class="nav-header">Features and Plugins</li>
+    <li><a href="../nodestore/overview.html" title="Node Storage"><span class="icon-chevron-down"></span>Node Storage</a>
+      <ul class="nav nav-list">
+    <li><a href="../nodestore/documentmk.html" title="Document NodeStore"><span class="icon-chevron-down"></span>Document NodeStore</a>
+      <ul class="nav nav-list">
+    <li><a href="../nodestore/document/node-bundling.html" title="Node Bundling"><span class="none"></span>Node Bundling</a>  </li>
+    <li><a href="../nodestore/document/secondary-store.html" title="Secondary Store"><span class="none"></span>Secondary Store</a>  </li>
+    <li><a href="../nodestore/persistent-cache.html" title="Persistent Cache"><span class="none"></span>Persistent Cache</a>  </li>
+    <li><a href="../clustering.html" title="Clustering"><span class="none"></span>Clustering</a>  </li>
+      </ul>
+  </li>
+    <li><a href="../nodestore/segment/overview.html" title="Segment NodeStore"><span class="none"></span>Segment NodeStore</a>  </li>
+      </ul>
+  </li>
+    <li><a href="../plugins/blobstore.html" title="Blob Storage"><span class="none"></span>Blob Storage</a>  </li>
+    <li><a href="../query/query.html" title="Query"><span class="icon-chevron-down"></span>Query</a>
+      <ul class="nav nav-list">
+    <li><a href="../query/query-engine.html" title="Query Engine"><span class="none"></span>Query Engine</a>  </li>
+    <li><a href="../query/query-troubleshooting.html" title="Troubleshooting"><span class="none"></span>Troubleshooting</a>  </li>
+    <li><a href="../query/indexing.html" title="Indexing"><span class="none"></span>Indexing</a>  </li>
+    <li><a href="../query/lucene.html" title="Lucene Index"><span class="none"></span>Lucene Index</a>  </li>
+    <li><a href="../query/property-index.html" title="Property Index"><span class="none"></span>Property Index</a>  </li>
+    <li><a href="../query/solr.html" title="Solr Index"><span class="none"></span>Solr Index</a>  </li>
+      </ul>
+  </li>
+    <li><a href="../security/overview.html" title="Security"><span class="none"></span>Security</a>  </li>
+    <li><a href="../features/atomic-counter.html" title="Atomic Counter"><span class="none"></span>Atomic Counter</a>  </li>
+    <li><a href="../features/observation.html" title="Observation"><span class="none"></span>Observation</a>  </li>
+          <li class="nav-header">Using Oak</li>
+    <li><a href="../use_getting_started.html" title="Getting Started"><span class="none"></span>Getting Started</a>  </li>
+    <li><a href="../construct.html" title="Repository Construction"><span class="none"></span>Repository Construction</a>  </li>
+    <li><a href="../osgi_config.html" title="Configuring Oak"><span class="none"></span>Configuring Oak</a>  </li>
+    <li><a href="../command_line.html" title="Command Line Tools"><span class="none"></span>Command Line Tools</a>  </li>
+    <li><a href="../migration.html" title="Migration"><span class="none"></span>Migration</a>  </li>
+    <li><a href="../differences.html" title="Differences to Jackrabbit 2"><span class="none"></span>Differences to Jackrabbit 2</a>  </li>
+    <li><a href="../known_issues.html" title="Known Issues"><span class="none"></span>Known Issues</a>  </li>
+    <li><a href="../dos_and_donts.html" title="Dos and Don'ts"><span class="none"></span>Dos and Don'ts</a>  </li>
+    <li><a href="../coldstandby/coldstandby.html" title="Cold Standby"><span class="none"></span>Cold Standby</a>  </li>
+    <li><a href="../FAQ.html" title="FAQ"><span class="none"></span>FAQ</a>  </li>
+          <li class="nav-header">Developing Oak</li>
+    <li><a href="../dev_getting_started.html" title="Getting Started"><span class="none"></span>Getting Started</a>  </li>
+    <li><a href="../participating.html" title="Participating"><span class="none"></span>Participating</a>  </li>
+    <li><a href="../developing-with-git.html" title="Developing with Git"><span class="none"></span>Developing with Git</a>  </li>
+    <li><a href="../diagnostic-builds.html" title="Cutting diagnostic builds"><span class="none"></span>Cutting diagnostic builds</a>  </li>
+    <li><a href="../attribution.html" title="Attribution"><span class="none"></span>Attribution</a>  </li>
+    <li><a href="../release-schedule.html" title="Release Schedule"><span class="none"></span>Release Schedule</a>  </li>
+          <li class="nav-header">Links</li>
+    <li><a href="http://jackrabbit.apache.org/oak" class="externalLink" title="Apache Jackrabbit Oak"><span class="none"></span>Apache Jackrabbit Oak</a>  </li>
+    <li><a href="http://jackrabbit.apache.org/" class="externalLink" title="Apache Jackrabbit"><span class="none"></span>Apache Jackrabbit</a>  </li>
+  </ul>
+          <hr />
+          <div id="poweredBy">
+          <script type="text/javascript">asyncJs( 'https://apis.google.com/js/plusone.js' )</script>
+        <div class="g-plusone" data-href="http://jackrabbit.apache.org/oak/docs/" data-size="tall" ></div>
+                  <div class="clear"></div>
+              <div class="clear"></div>
+              <div class="clear"></div>
+              <div class="clear"></div>
+  <a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy"><img class="builtBy" alt="Built by Maven" src="../images/logos/maven-feather.png" /></a>
+              </div>
+          </div>
+        </div>
+        <div id="bodyColumn"  class="span10" >
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+  --><h1>Pre-Extracting Text from Binaries</h1>
+<p><tt>@since Oak 1.0.18, 1.2.3</tt></p>
+<p>Lucene indexing is performed in a single threaded mode. Extracting text from binaries is an expensive operation and slows down the indexing rate considerably. For incremental indexing this mostly works fine but if performing a reindex or creating the index for the first time after migration then it increases the indexing time considerably. To speed up such cases Oak supports pre extracting text from binaries to avoid extracting text at indexing time. This feature consist of 2 broad steps </p>
+
+<ol style="list-style-type: decimal">
+  
+<li>Extract and store the extracted text from binaries using oak-run tooling.</li>
+  
+<li>Configure Oak runtime to use the extracted text at time of indexing via <tt>PreExtractedTextProvider</tt></li>
+</ol>
+<p>For more details on this feature refer to <a class="externalLink" href="https://issues.apache.org/jira/browse/OAK-2892">OAK-2892</a></p>
+<div class="section">
+<h2><a name="A_-_Oak_Run_Pre-Extraction_Command"></a>A - Oak Run Pre-Extraction Command</h2>
+<p>Oak run tool provides a <tt>tika</tt> command which supports traversing the repository and then extracting text from the binary properties. </p>
+<div class="section">
+<h3><a name="Step_1_-_oak-run_Setup"></a>Step 1 - oak-run Setup</h3>
+<p>Download following jars</p>
+
+<ul>
+  
+<li>oak-run 1.7.4</li>
+</ul>
+<p>Refer to <a href="../features/oak-run-nodestore-connection-options.md">oak-run setup</a> for details about connecting to different types of NodeStore. Example below assume a setup consisting of SegmentNodeStore and FileDataStore. Depending on setup use the appropriate connection options.</p>
+<p>You can use current oak-run version to perform text extraction for older Oak setups i.e. its fine to use oak-run from 1.7.x branch to connect to Oak repositories from version 1.0.x or later. The oak-run tooling connects to the repository in read only mode and hence safe to use with older version.</p>
+<p>The generated extracted text dir can then be used with older setup.</p></div>
+<div class="section">
+<h3><a name="Step_2_-_Generate_the_csv_file"></a>Step 2 - Generate the csv file</h3>
+<p>As the first step you would need to generate a csv file which would contain details about the binary property. This file would be generated by using the <tt>tika</tt> command from oak-run. In this step oak-run would connect to repository in read only mode. </p>
+<p>To generate the csv file use the <tt>--generate</tt> action</p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">    java -jar oak-run.jar tika \
+    --fds-path /path/to/datastore \
+    /path/to/segmentstore --data-file oak-binary-stats.csv --generate
+</pre></div></div>
+<p>If connecting to S3 this command can take long time because checking binary id currently triggers download of the actual binary content which we do not require. To speed up here we can use the Fake DataStore support of oak-run</p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">    java -jar oak-run.jar tika \
+    --fake-ds-path=temp \
+    /path/to/segmentstore --data-file oak-binary-stats.csv --generate
+</pre></div></div>
+<p>This would generate a csv file with content like below</p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">43844ed22d640a114134e5a25550244e8836c00c#28705,28705,&quot;application/octet-stream&quot;,,&quot;/content/activities/jcr:content/folderThumbnail/jcr:content&quot;
+43844ed22d640a114134e5a25550244e8836c00c#28705,28705,&quot;application/octet-stream&quot;,,&quot;/content/snowboarding/jcr:content/folderThumbnail/jcr:content&quot;
+...
+</pre></div></div>
+<p>By default it scans whole repository. If you need to restrict it to look up under certain path then specify the path via <tt>--path</tt> option.</p></div>
+<div class="section">
+<h3><a name="Step_3_-_Perform_the_text_extraction"></a>Step 3 - Perform the text extraction</h3>
+<p>Once the csv file is generated we need to perform the text extraction. To do that we would need to download the <a class="externalLink" href="https://tika.apache.org/download.html">tika-app</a> jar from Tika downloads. You should be able to use 1.15 version with Oak 1.7.4 jar.</p>
+<p>To perform the text extraction use the <tt>--extract</tt> action</p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">    java -cp oak-run.jar:tika-app-1.15.jar \
+    org.apache.jackrabbit.oak.run.Main tika \
+    --data-file binary-stats.csv \
+    --store-path ./store  \
+    --fds-path /path/to/datastore  extract
+</pre></div></div>
+<p>This command does not require access to NodeStore and only requires access to the BlobStore. So configure the BlobStore which is in use like FileDataStore or S3DataStore. Above command would do text extraction using multiple threads and store the extracted text in directory specified by <tt>--store-path</tt>. </p>
+<p>Currently extracted text files are stored as files per blob in a format which is same one used with <tt>FileDataStore</tt> In addition to that it creates 2 files</p>
+
+<ul>
+  
+<li>blobs_error.txt - File containing blobIds for which text extraction ended in error</li>
+  
+<li>blobs_empty.txt - File containing blobIds for which no text was extracted</li>
+</ul>
+<p>This phase is incremental i.e. if run multiple times and same <tt>--store-path</tt> is specified then it would avoid extracting text from previously processed binaries.</p>
+<p>Further the <tt>extract</tt> phase only needs access to <tt>BlobStore</tt> and does not require access to NodeStore. So this can be run from a different machine (possibly more powerful to allow use of multiple cores) to speed up text extraction. One can also split the csv into multiple chunks and process them on different machines and then merge the stores later. Just ensure that at merge time blobs*.txt files are also merged</p>
+<p>Note that we need to launch the command with <tt>-cp</tt> instead of <tt>-jar</tt> as we need to include classes outside of oak-run jar like tika-app. Also ensure that oak-run comes before in classpath. This is required due to some old classes being packaged in tika-app </p></div></div>
+<div class="section">
+<h2><a name="B_-_PreExtractedTextProvider"></a>B - PreExtractedTextProvider</h2>
+<p>In this step we would configure Oak to make use of the pre extracted text for the indexing. Depending on how indexing is being performed you would configure the <tt>PreExtractedTextProvider</tt> either in OSGi or in oak-run index command</p>
+<div class="section">
+<h3><a name="Oak_application"></a>Oak application</h3>
+<p><tt>@since Oak 1.0.18, 1.2.3</tt></p>
+<p>For this look for OSGi config for <tt>Apache Jackrabbit Oak DataStore PreExtractedTextProvider</tt></p>
+
+<div class="source">
+<div class="source"><pre class="prettyprint">![OSGi Configuration](pre-extracted-text-osgi.png)   
+</pre></div></div>
+<p>Once <tt>PreExtractedTextProvider</tt> is configured then upon reindexing Lucene indexer would make use of it to check if text needs to be extracted or not. Check <tt>TextExtractionStatsMBean</tt> for various statistics around text extraction and also to validate if <tt>PreExtractedTextProvider</tt> is being used.</p></div>
+<div class="section">
+<h3><a name="Oak_Run_Indexing"></a>Oak Run Indexing</h3>
+<p>Configure the directory storing pre extracted text via <tt>--pre-extracted-text-dir</tt> option in <tt>index</tt> command. See <a href="oak-run-indexing.md">oak run indexing</a></p></div></div>
+        </div>
+      </div>
+    </div>
+    <hr/>
+    <footer>
+      <div class="container-fluid">
+        <div class="row-fluid">
+            <p>Copyright &copy;2012&#x2013;2017
+<a href="https://www.apache.org/">The Apache Software Foundation</a>.
+All rights reserved.</p>
+        </div>
+                          <div id="ohloh" class="pull-right">
+      <script type="text/javascript" src="https://www.ohloh.net/p/jackrabbit-oak/widgets/project_thin_badge.js"></script>
+    </div>
+        </div>
+    </footer>
+    </body>
+</html>
\ No newline at end of file

Propchange: jackrabbit/site/live/oak/docs/query/pre-extract-text.html
------------------------------------------------------------------------------
    svn:eol-style = native



Mime
View raw message