lucene-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jbern...@apache.org
Subject [01/13] lucene-solr:branch_7x: SOLR-11947: Squashed commit of the following ref guide changes:
Date Sun, 12 Aug 2018 17:02:34 GMT
Repository: lucene-solr
Updated Branches:
  refs/heads/branch_7x 6759ba729 -> 9a29fd59a


http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/20dfd125/solr/solr-ref-guide/src/regression.adoc
----------------------------------------------------------------------
diff --git a/solr/solr-ref-guide/src/regression.adoc b/solr/solr-ref-guide/src/regression.adoc
new file mode 100644
index 0000000..b57c62b
--- /dev/null
+++ b/solr/solr-ref-guide/src/regression.adoc
@@ -0,0 +1,439 @@
+= Linear Regression
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+This section of the math expressions user guide covers simple and multivariate linear regression.
+
+
+== Simple Linear Regression
+
+The `regress` function is used to build a linear regression model
+between two random variables. Sample observations are provided with two
+numeric arrays. The first numeric array is the *independent variable* and
+the second array is the *dependent variable*.
+
+In the example below the `random` function selects 5000 random samples each containing
+the fields *filesize_d* and *response_d*. The two fields are vectorized
+and stored in variables *b* and *c*. Then the `regress` function performs a regression
+analysis on the two numeric arrays.
+
+The `regress` function returns a single tuple with the results of the regression
+analysis.
+
+Note that in this regression analysis the value of *RSquared* is *.75*. This means that changes in
+*filesize_d* explain 75% of the variability of the *response_d* variable.
+
+[source,text]
+----
+let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
+    b=col(a, filesize_d),
+    c=col(a, response_d),
+    d=regress(b, c))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "d": {
+          "significance": 0,
+          "totalSumSquares": 10564812.895147054,
+          "R": 0.8674822407146515,
+          "RSquared": 0.7525254379553127,
+          "meanSquareError": 523.1137343558588,
+          "intercept": -49.528134913099095,
+          "slopeConfidenceInterval": 0.0003171801710329995,
+          "regressionSumSquares": 7950290.450836472,
+          "slope": 0.019945557923159506,
+          "interceptStdErr": 6.489732340389941,
+          "N": 5000
+        }
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 98
+      }
+    ]
+  }
+}
+----
+
+=== Prediction
+
+The `predict` function uses the regression model to make predictions.
+Using the example above the regression model can be used to predict the value
+of *response_d* given a value for *filesize_d*.
+
+In the example below the `predict` function uses the regression analysis to predict
+the value of *response_d* for the *filesize_d* value of 40000.
+
+
+[source,text]
+----
+let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
+    b=col(a, filesize_d),
+    c=col(a, response_d),
+    d=regress(b, c),
+    e=predict(d, 40000))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "e": 748.079241022975
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 95
+      }
+    ]
+  }
+}
+----
+
+The `predict` function can also make predictions for an array of values. In this
+case it returns an array of predictions.
+
+In the example below the `predict` function uses the regression analysis to
+predict values for each of the 5000 samples of `filesize_d` used to generate the model.
+In this case 5000 predictions are returned.
+
+[source,text]
+----
+let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
+    b=col(a, filesize_d),
+    c=col(a, response_d),
+    d=regress(b, c),
+    e=predict(d, b))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "e": [
+          742.2525322514165,
+          709.6972488729955,
+          687.8382568904871,
+          820.2511324266264,
+          720.4006432289061,
+          761.1578181053039,
+          759.1304101159126,
+          699.5597256337142,
+          742.4738911248204,
+          769.0342605881644,
+          746.6740473150268,
+          ...
+          ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 113
+      }
+    ]
+  }
+}
+----
+
+=== Residuals
+
+The difference between the observed value and the predicted value is known as the
+residual. There isn't a specific function to calculate the residuals but vector
+math can used to perform the calculation.
+
+In the example below the predictions are stored in variable *e*. The `ebeSubtract`
+function is then used to subtract the predictions
+from the actual *response_d* values stored in variable *c*. Variable *f* contains
+the array of residuals.
+
+[source,text]
+----
+let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
+    b=col(a, filesize_d),
+    c=col(a, response_d),
+    d=regress(b, c),
+    e=predict(d, b),
+    f=ebeSubtract(c, e))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "e": [
+          31.30678554491226,
+          -30.292830927953446,
+          -30.49508862647258,
+          -30.499884780783532,
+          -9.696458959319784,
+          -30.521563961535094,
+          -30.28380938033081,
+          -9.890289849359306,
+          30.819723560583157,
+          -30.213178859683012,
+          -30.609943619066826,
+          10.527700442607625,
+          10.68046928406568,
+          ...
+          ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 113
+      }
+    ]
+  }
+}
+----
+
+== Multivariate Linear Regression
+
+The `olsRegress` function performs a multivariate linear regression analysis. Multivariate linear
+regression models the linear relationship between two or more *independent* variables and a *dependent* variable.
+
+The example below extends the simple linear regression example by introducing a new independent variable
+called *service_d*. The *service_d* variable is the service level of the request and it can range from 1 to 4
+in the data-set. The higher the service level, the higher the bandwidth available for the request.
+
+Notice that the two independent variables *filesize_d* and *service_d* are vectorized and stored
+in the variables *b* and *c*. The variables *b* and *c* are then added as rows to a `matrix`. The matrix is
+then transposed so that each row in the matrix represents one observation with *filesize_d* and *service_d*.
+The `olsRegress` function then performs the multivariate regression analysis using the observation matrix as the
+independent variables and the *response_d* values, stored in variable *d*, as the dependent variable.
+
+Notice that the RSquared of the regression analysis is 1. This means that linear relationship between
+*filesize_d* and *service_d* describe 100% of the variability of the *response_d* variable.
+
+[source,text]
+----
+let(a=random(collection2, q="*:*", rows="30000", fl="filesize_d, service_d, response_d"),
+    b=col(a, filesize_d),
+    c=col(a, service_d),
+    d=col(a, response_d),
+    e=transpose(matrix(b, c)),
+    f=olsRegress(e, d))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "f": {
+          "regressionParametersStandardErrors": [
+            2.0660690430026933e-13,
+            5.1212982077663434e-18,
+            9.10920932555875e-15
+          ],
+          "RSquared": 1,
+          "regressionParameters": [
+            6.553210695971329e-12,
+            0.019999999999999858,
+            -20.49999999999968
+          ],
+          "regressandVariance": 2124.130825172683,
+          "regressionParametersVariance": [
+            [
+              0.013660174897582315,
+              -3.361258014840509e-7,
+              -0.00006893737578369605
+            ],
+            [
+              -3.361258014840509e-7,
+              8.393183709503206e-12,
+              6.430253229589981e-11
+            ],
+            [
+              -0.00006893737578369605,
+              6.430253229589981e-11,
+              0.000026553878455570856
+            ]
+          ],
+          "adjustedRSquared": 1,
+          "residualSumSquares": 9.373703759269822e-20
+        }
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 690
+      }
+    ]
+  }
+}
+----
+
+=== Prediction
+
+The `predict` function can also be used to make predictions for multivariate linear regression. Below is an example
+of a single prediction using the multivariate linear regression model and a single observation. The observation
+is an array that matches the structure of the observation matrix used to build the model. In this case
+the first value represent a *filesize_d* of 40000 and the second value represents a *service_d* of 4.
+
+[source,text]
+----
+let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, service_d, response_d"),
+    b=col(a, filesize_d),
+    c=col(a, service_d),
+    d=col(a, response_d),
+    e=transpose(matrix(b, c)),
+    f=olsRegress(e, d),
+    g=predict(f, array(40000, 4)))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "g": 718.0000000000005
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 117
+      }
+    ]
+  }
+}
+----
+
+The `predict` function can also make predictions for more than one multivariate observation. In this scenario
+an observation matrix used. In the example below the observation matrix used to build the multivariate regression model
+is passed to the `predict` function and it returns an array of predictions.
+
+
+[source,text]
+----
+let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, service_d, response_d"),
+    b=col(a, filesize_d),
+    c=col(a, service_d),
+    d=col(a, response_d),
+    e=transpose(matrix(b, c)),
+    f=olsRegress(e, d),
+    g=predict(f, e))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "e": [
+          685.498283591961,
+          801.2175699959365,
+          776.7638245911025,
+          610.3559852681935,
+          751.0925865965207,
+          787.2914663381897,
+          744.3632053810668,
+          688.3729301599697,
+          765.367783417171,
+          724.9309687628346,
+          834.4350712384264,
+          ...
+          ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 113
+      }
+    ]
+  }
+}
+----
+
+=== Residuals
+
+Once the predictions are generated the residuals can be calculated using the same approach used with
+simple linear regression.
+
+Below is an example of the residuals calculation following a multivariate linear regression. In the example
+the predictions stored variable *g* are subtracted from observed values stored in variable *d*.
+
+[source,text]
+----
+let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, service_d, response_d"),
+    b=col(a, filesize_d),
+    c=col(a, service_d),
+    d=col(a, response_d),
+    e=transpose(matrix(b, c)),
+    f=olsRegress(e, d),
+    g=predict(f, e),
+    h=ebeSubtract(d, g))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "e": [
+         1.1368683772161603e-13,
+         1.1368683772161603e-13,
+         0,
+         1.1368683772161603e-13,
+         0,
+         1.1368683772161603e-13,
+         0,
+         2.2737367544323206e-13,
+         1.1368683772161603e-13,
+         2.2737367544323206e-13,
+         1.1368683772161603e-13,
+          ...
+          ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 113
+      }
+    ]
+  }
+}
+----
+
+
+
+

http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/20dfd125/solr/solr-ref-guide/src/scalar-math.adoc
----------------------------------------------------------------------
diff --git a/solr/solr-ref-guide/src/scalar-math.adoc b/solr/solr-ref-guide/src/scalar-math.adoc
new file mode 100644
index 0000000..07b1eb5
--- /dev/null
+++ b/solr/solr-ref-guide/src/scalar-math.adoc
@@ -0,0 +1,137 @@
+= Scalar Math
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+The most basic math expressions are scalar expressions. Scalar expressions
+perform mathematical operations on numbers.
+
+For example the expression below adds two numbers together:
+
+[source,text]
+----
+add(1, 1)
+----
+
+When this expression is sent to the /stream handler it
+responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "return-value": 2
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 2
+      }
+    ]
+  }
+}
+----
+
+Math expressions can be nested. For example in the expression
+below the output of the `add` function is the second parameter
+of the `pow` function:
+
+[source,text]
+----
+pow(10, add(1,1))
+----
+
+This expression returns the following response:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "return-value": 100
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 0
+      }
+    ]
+  }
+}
+----
+
+== Streaming Scalar Math
+
+Scalar math expressions can also be applied to each tuple in a stream
+through use of the `select` stream decorator. The `select` function wraps a
+stream of tuples and selects fields to include in each tuple.
+The `select` function can also use math expressions to compute
+new values and add them to the outgoing tuples.
+
+In the example below the `select` expression is wrapping a search
+expression. The `select` function is selecting the *price_f* field
+and computing a new field called *newPrice* using the `mult` math
+expression.
+
+The first parameter of the `mult` expression is the *price_f* field.
+The second parameter is the scalar value 10. This multiplies the value
+of the *price_f* field in each tuple by 10.
+
+[source,text]
+----
+select(search(collection2, q="*:*", fl="price_f", sort="price_f desc", rows="3"),
+       price_f,
+       mult(price_f, 10) as newPrice)
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "price_f": 0.99999994,
+        "newPrice": 9.9999994
+      },
+      {
+        "price_f": 0.99999994,
+        "newPrice": 9.9999994
+      },
+      {
+        "price_f": 0.9999992,
+        "newPrice": 9.999992
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 3
+      }
+    ]
+  }
+}
+----
+
+== More Scalar Math Functions
+
+The following scalar math functions are available in the math expressions library:
+
+`abs`, `add`, `div`, `mult`, `sub`, `log`,
+`pow`, `mod`, `ceil`, `floor`, `sin`, `asin`,
+`sinh`, `cos`, `acos`, `cosh`, `tan`, `atan`,
+`tanh`, `round`, `precision`, `sqrt`, `cbrt`
+

http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/20dfd125/solr/solr-ref-guide/src/statistics.adoc
----------------------------------------------------------------------
diff --git a/solr/solr-ref-guide/src/statistics.adoc b/solr/solr-ref-guide/src/statistics.adoc
new file mode 100644
index 0000000..74da76b
--- /dev/null
+++ b/solr/solr-ref-guide/src/statistics.adoc
@@ -0,0 +1,575 @@
+= Statistics
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+This section of the user guide covers the core statistical functions
+available in math expressions.
+
+== Descriptive Statistics
+
+The `describe` function can be used to return descriptive statistics about a
+numeric array. The `describe` function returns a single *tuple* with name/value
+pairs containing descriptive statistics.
+
+Below is a simple example that selects a random sample of documents,
+vectorizes the *price_f* field in the result set and uses the `describe` function to
+return descriptive statistics about the vector:
+
+[source,text]
+----
+let(a=random(collection1, q="*:*", rows="1500", fl="price_f"),
+    b=col(a, price_f),
+    c=describe(b))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "c": {
+          "sumsq": 4999.041975263254,
+          "max": 0.99995726,
+          "var": 0.08344429493940454,
+          "geometricMean": 0.36696588922559575,
+          "sum": 7497.460565552007,
+          "kurtosis": -1.2000739963006035,
+          "N": 15000,
+          "min": 0.00012338161,
+          "mean": 0.49983070437013266,
+          "popVar": 0.08343873198640858,
+          "skewness": -0.001735537500095477,
+          "stdev": 0.28886726179926403
+        }
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 305
+      }
+    ]
+  }
+}
+----
+
+== Histograms and Frequency Tables
+
+Histograms and frequency tables are are tools for understanding the distribution
+of a random variable.
+
+The `hist` function creates a histogram designed for usage with continuous data. The
+`freqTable` function creates a frequency table for use with discrete data.
+
+=== histograms
+
+Below is an example that selects a random sample, creates a vector from the
+result set and uses the `hist` function to return a histogram with 5 bins.
+The `hist` function returns a list of tuples with summary statistics for each bin.
+
+[source,text]
+----
+let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
+    b=col(a, price_f),
+    c=hist(b, 5))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "c": [
+          {
+            "prob": 0.2057939717603699,
+            "min": 0.000010371208,
+            "max": 0.19996578,
+            "mean": 0.10010319358402578,
+            "var": 0.003366805016271609,
+            "cumProb": 0.10293732468049072,
+            "sum": 309.0185585938884,
+            "stdev": 0.058024176136086666,
+            "N": 3087
+          },
+          {
+            "prob": 0.19381868629885585,
+            "min": 0.20007741,
+            "max": 0.3999073,
+            "mean": 0.2993590803885827,
+            "var": 0.003401644034068929,
+            "cumProb": 0.3025295802728267,
+            "sum": 870.5362057700005,
+            "stdev": 0.0583236147205309,
+            "N": 2908
+          },
+          {
+            "prob": 0.20565789836690007,
+            "min": 0.39995712,
+            "max": 0.5999038,
+            "mean": 0.4993620963792545,
+            "var": 0.0033158364923609046,
+            "cumProb": 0.5023006239697967,
+            "sum": 1540.5320673300018,
+            "stdev": 0.05758330046429177,
+            "N": 3085
+          },
+          {
+            "prob": 0.19437108496008693,
+            "min": 0.6000449,
+            "max": 0.79973197,
+            "mean": 0.7001752711861512,
+            "var": 0.0033895105082360185,
+            "cumProb": 0.7026537198687285,
+            "sum": 2042.4112660500066,
+            "stdev": 0.058219502816805456,
+            "N": 2917
+          },
+          {
+            "prob": 0.20019582213899467,
+            "min": 0.7999126,
+            "max": 0.99987316,
+            "mean": 0.8985428275824184,
+            "var": 0.003312360017780078,
+            "cumProb": 0.899450457219298,
+            "sum": 2698.3241112299997,
+            "stdev": 0.05755310606544253,
+            "N": 3003
+          }
+        ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 322
+      }
+    ]
+  }
+}
+----
+
+The `col` function can be used to *vectorize* a column of data from the list of tuples
+returned by the `hist` function.
+
+In the example below, the *N* field,
+which is the number of observations in the each bin, is returned as a vector.
+
+[source,text]
+----
+let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
+     b=col(a, price_f),
+     c=hist(b, 11),
+     d=col(c, N))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "d": [
+          1387,
+          1396,
+          1391,
+          1357,
+          1384,
+          1360,
+          1367,
+          1375,
+          1307,
+          1310,
+          1366
+        ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 307
+      }
+    ]
+  }
+}
+----
+
+=== Frequency Tables
+
+The `freqTable` function returns a frequency distribution for a discrete data set.
+The `freqTable` function doesn't create bins like the histogram. Instead it counts
+the occurrence of each discrete data value and returns a list of tuples with the
+frequency statistics for each value. Fields from a frequency table can be vectorized using
+using the `col` function in the same manner as a histogram.
+
+Below is a simple example of a frequency table built from a random sample of
+a discrete variable.
+
+[source,text]
+----
+let(a=random(collection1, q="*:*", rows="15000", fl="day_i"),
+     b=col(a, day_i),
+     c=freqTable(b))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+  "result-set": {
+    "docs": [
+      {
+        "c": [
+          {
+            "pct": 0.0318,
+            "count": 477,
+            "cumFreq": 477,
+            "cumPct": 0.0318,
+            "value": 0
+          },
+          {
+            "pct": 0.033133333333333334,
+            "count": 497,
+            "cumFreq": 974,
+            "cumPct": 0.06493333333333333,
+            "value": 1
+          },
+          {
+            "pct": 0.03426666666666667,
+            "count": 514,
+            "cumFreq": 1488,
+            "cumPct": 0.0992,
+            "value": 2
+          },
+          {
+            "pct": 0.0346,
+            "count": 519,
+            "cumFreq": 2007,
+            "cumPct": 0.1338,
+            "value": 3
+          },
+          {
+            "pct": 0.03133333333333333,
+            "count": 470,
+            "cumFreq": 2477,
+            "cumPct": 0.16513333333333333,
+            "value": 4
+          },
+          {
+            "pct": 0.03333333333333333,
+            "count": 500,
+            "cumFreq": 2977,
+            "cumPct": 0.19846666666666668,
+            "value": 5
+          }
+        ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 281
+      }
+    ]
+  }
+}
+----
+
+== Percentiles
+
+The `percentile` function returns the estimated value for a specific percentile in
+a sample set. The example below returns the estimation for the 95th percentile
+of the *price_f* field.
+
+[source,text]
+----
+let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
+     b=col(a, price_f),
+     c=percentile(b, 95))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+ {
+   "result-set": {
+     "docs": [
+       {
+         "c": 312.94
+       },
+       {
+         "EOF": true,
+         "RESPONSE_TIME": 286
+       }
+     ]
+   }
+ }
+----
+
+== Covariance and Correlation
+
+Covariance and Correlation measure how random variables move
+together.
+
+=== Covariance and Covariance Matrices
+
+The `cov` function calculates the covariance of two sample sets of data.
+
+In the example below covariance is calculated for two numeric
+arrays.
+
+The example below uses arrays created by the `array` function. Its important to note that
+vectorized data from Solr Cloud collections can be used with any function that
+operates on arrays.
+
+[source,text]
+----
+let(a=array(1, 2, 3, 4, 5),
+    b=array(100, 200, 300, 400, 500),
+    c=cov(a, b))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+ {
+   "result-set": {
+     "docs": [
+       {
+         "c": 0.9484775349999998
+       },
+       {
+         "EOF": true,
+         "RESPONSE_TIME": 286
+       }
+     ]
+   }
+ }
+----
+
+If a matrix is passed to the `cov` function it will automatically compute a covariance
+matrix for the columns of the matrix.
+
+Notice in the example three numeric arrays are added as rows
+in a matrix. The matrix is then transposed to turn the rows into
+columns, and the covariance matrix is computed for the columns of the
+matrix.
+
+[source,text]
+----
+let(a=array(1, 2, 3, 4, 5),
+     b=array(100, 200, 300, 400, 500),
+     c=array(30, 40, 80, 90, 110),
+     d=transpose(matrix(a, b, c)),
+     e=cov(d))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+ {
+   "result-set": {
+     "docs": [
+       {
+         "e": [
+           [
+             2.5,
+             250,
+             52.5
+           ],
+           [
+             250,
+             25000,
+             5250
+           ],
+           [
+             52.5,
+             5250,
+             1150
+           ]
+         ]
+       },
+       {
+         "EOF": true,
+         "RESPONSE_TIME": 2
+       }
+     ]
+   }
+ }
+----
+
+=== Correlation and Correlation Matrices
+
+Correlation is measure of covariance that has been scaled between
+-1 and 1.
+
+Three correlation types are supported:
+
+* *pearsons* (default)
+* *kendalls*
+* *spearmans*
+
+The type of correlation is specified by adding the *type* named parameter in the
+function call. The example below demonstrates the use of the *type*
+named parameter.
+
+[source,text]
+----
+let(a=array(1, 2, 3, 4, 5),
+    b=array(100, 200, 300, 400, 5000),
+    c=corr(a, b, type=spearmans))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+ {
+   "result-set": {
+     "docs": [
+       {
+         "c": 0.7432941462471664
+       },
+       {
+         "EOF": true,
+         "RESPONSE_TIME": 0
+       }
+     ]
+   }
+ }
+----
+
+Like the `cov` function, the `corr` function automatically builds a correlation matrix
+if a matrix is passed as a parameter. The correlation matrix is built by correlating the columns
+of the matrix passed in.
+
+== Statistical Inference Tests
+
+Statistical inference tests test a hypothesis on *random samples* and return p-values which
+can be used to infer the reliability of the test for the entire population.
+
+The following statistical inference tests are available:
+
+* `anova`: One-Way-Anova tests if there is a statistically significant difference in the
+means of two or more random samples.
+
+* `ttest`: The T-test tests if there is a statistically significant difference in the means of two
+random samples.
+
+* `pairedTtest`: The paired t-test tests if there is a statistically significant difference
+in the means of two random samples with paired data.
+
+* `gTestDataSet`: The G-test tests if two samples of binned discrete data were drawn
+from the same population.
+
+* `chiSquareDataset`: The Chi-Squared test tests if two samples of binned discrete data were
+drawn from the same population.
+
+* `mannWhitney`: The Mann-Whitney test is a non-parametric test that tests if two
+samples of continuous were pulled
+from the same population. The Mann-Whitney test is often used instead of the T-test when the
+underlying assumptions of the T-test are not
+met.
+
+* `ks`: The Kolmogorov-Smirnov test tests if two samples of continuous data were drawn from
+the same distribution.
+
+Below is a simple example of a T-test performed on two random samples.
+The returned p-value of .93 means we can accept the null hypothesis
+that the two samples do not have statistically significantly differences in the means.
+
+[source,text]
+----
+let(a=random(collection1, q="*:*", rows="1500", fl="price_f"),
+    b=random(collection1, q="*:*", rows="1500", fl="price_f"),
+    c=col(a, price_f),
+    d=col(b, price_f),
+    e=ttest(c, d))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "e": {
+          "p-value": 0.9350135639249795,
+          "t-statistic": 0.081545541074817
+        }
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 48
+      }
+    ]
+  }
+}
+----
+
+== Transformations
+
+In statistical analysis its often useful to transform data sets before performing
+statistical calculations. The statistical function library includes the following
+commonly used transformations:
+
+* `rank`: Returns a numeric array with the rank-transformed value of each element of the original
+array.
+
+* `log`: Returns a numeric array with the natural log of each element of the original array.
+
+* `sqrt`: Returns a numeric array with the square root of each element of the original array.
+
+* `cbrt`: Returns a numeric array with the cube root of each element of the original array.
+
+Below is an example of a ttest performed on log transformed data sets:
+
+[source,text]
+----
+let(a=random(collection1, q="*:*", rows="1500", fl="price_f"),
+    b=random(collection1, q="*:*", rows="1500", fl="price_f"),
+    c=log(col(a, price_f)),
+    d=log(col(b, price_f)),
+    e=ttest(c, d))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "e": {
+          "p-value": 0.9655110070265056,
+          "t-statistic": -0.04324265449471238
+        }
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 58
+      }
+    ]
+  }
+}
+----

http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/20dfd125/solr/solr-ref-guide/src/streaming-expressions.adoc
----------------------------------------------------------------------
diff --git a/solr/solr-ref-guide/src/streaming-expressions.adoc b/solr/solr-ref-guide/src/streaming-expressions.adoc
index 1c34c73..3273df9 100644
--- a/solr/solr-ref-guide/src/streaming-expressions.adoc
+++ b/solr/solr-ref-guide/src/streaming-expressions.adoc
@@ -1,5 +1,5 @@
 = Streaming Expressions
-:page-children: stream-source-reference, stream-decorator-reference, stream-evaluator-reference, statistical-programming, graph-traversal
+:page-children: stream-source-reference, stream-decorator-reference, stream-evaluator-reference, statistical-programming, math-expressions, graph-traversal
 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information

http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/20dfd125/solr/solr-ref-guide/src/term-vectors.adoc
----------------------------------------------------------------------
diff --git a/solr/solr-ref-guide/src/term-vectors.adoc b/solr/solr-ref-guide/src/term-vectors.adoc
new file mode 100644
index 0000000..cbd21a0
--- /dev/null
+++ b/solr/solr-ref-guide/src/term-vectors.adoc
@@ -0,0 +1,237 @@
+= Text Analysis and Term Vectors
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+TF-IDF term vectors are often used to represent text documents when performing text mining
+and machine learning operations. This section of the user guide describes how to
+use math expressions to perform text analysis and create TF-IDF term vectors.
+
+== Text Analysis
+
+The `analyze` function applies a Solr analyzer to a text field and returns the tokens
+emitted by the analyzer in an array. Any analyzer chain that is attached to a field in Solr's
+schema can be used with the `analyze` function.
+
+In the example below, the text "hello world" is analyzed using the analyzer chain attached to the *subject* field in
+the schema. The *subject* field is defined as the field type *text_general* and the text is analyzed using the
+analysis chain configured for the *text_general* field type.
+
+[source,text]
+----
+analyze("hello world", subject)
+----
+
+When this expression is sent to the /stream handler it
+responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "return-value": [
+          "hello",
+          "world"
+        ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 0
+      }
+    ]
+  }
+}
+----
+
+=== Annotating Documents
+
+The `analyze` function can be used inside of a `select` function to annotate documents with the tokens
+generated by the analysis.
+
+The example below is performing a `search` in collection1. Each tuple returned by the `search`
+contains an *id* and *subject*. For each tuple, the
+`select` function is selecting the *id* field and calling the `analyze` function on the *subject* field.
+The analyzer chain specified by the *subject_bigram* field is configured to perform a bigram analysis.
+The tokens generated by the `analyze` function are added to each tuple in a field called `terms`.
+
+Notice in the output that an array of bigram terms have been added to the tuples.
+
+[source,text]
+----
+select(search(collection1, q="*:*", fl="id, subject", sort="id asc"),
+       id,
+       analyze(subject, subject_bigram) as terms)
+----
+
+When this expression is sent to the /stream handler it
+responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "terms": [
+          "text analysis",
+          "analysis example"
+        ],
+        "id": "1"
+      },
+      {
+        "terms": [
+          "example number",
+          "number two"
+        ],
+        "id": "2"
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 4
+      }
+    ]
+  }
+}
+----
+
+== Term Vectors
+
+The `termVectors` function can be used to build *TF-IDF*
+term vectors from the terms generated by the `analyze` function.
+
+The `termVectors` function operates over a list of tuples that contain a field
+called *id* and a field called *terms*. Notice
+that this is the exact output structure of the *document annotation* example above.
+
+The `termVectors` function builds a *matrix* from the list of tuples. There is *row* in the
+matrix for each tuple in the list. There is a *column* in the matrix for each term in the *terms*
+field.
+
+The example below builds on the *document annotation* example.
+The list of tuples are stored in variable *a*. The `termVectors` function
+operates over variable *a* and builds a matrix with *2 rows* and *4 columns*.
+
+The `termVectors` function also sets the *row* and *column* labels of the term vectors matrix.
+The row labels are the document ids and the
+column labels are the terms.
+
+In the example below, the `getRowLabels` and `getColumnLabels` functions return
+the row and column labels which are then stored in variables *c* and *d*.
+The *echo* parameter is echoing variables *c* and *d*, so the output includes
+the row and column labels.
+
+[source,text]
+----
+let(echo="c, d",
+    a=select(search(collection3, q="*:*", fl="id, subject", sort="id asc"),
+             id,
+             analyze(subject, subject_bigram) as terms),
+    b=termVectors(a, minTermLength=4, minDocFreq=0, maxDocFreq=1),
+    c=getRowLabels(b),
+    d=getColumnLabels(b))
+----
+
+When this expression is sent to the /stream handler it
+responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "c": [
+          "1",
+          "2"
+        ],
+        "d": [
+          "analysis example",
+          "example number",
+          "number two",
+          "text analysis"
+        ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 5
+      }
+    ]
+  }
+}
+----
+
+=== TF-IDF Values
+
+The values within the term vectors matrix are the TF-IDF values for each term in each document. The
+example below shows the values of the matrix.
+
+[source,text]
+----
+let(a=select(search(collection3, q="*:*", fl="id, subject", sort="id asc"),
+             id,
+             analyze(subject, subject_bigram) as terms),
+    b=termVectors(a, minTermLength=4, minDocFreq=0, maxDocFreq=1))
+----
+
+When this expression is sent to the /stream handler it
+responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "b": [
+          [
+            1.4054651081081644,
+            0,
+            0,
+            1.4054651081081644
+          ],
+          [
+            0,
+            1.4054651081081644,
+            1.4054651081081644,
+            0
+          ]
+        ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 5
+      }
+    ]
+  }
+}
+----
+
+=== Limiting the Noise
+
+One of the key challenges when with working term vectors is that text often has a significant amount of noise
+which can obscure the important terms in the data. The `termVectors` function has several parameters
+designed to filter out the less meaningful terms. This is also important because eliminating
+the noisy terms helps keep the term vector matrix small enough to fit comfortably in memory.
+
+There are four parameters designed to filter noisy terms from the term vector matrix:
+
+* *minTermLength*: The minimum term length required to include the term in the matrix.
+* *minDocFreq*: The minimum *percentage* (0 to 1) of documents the term must appear in to be included in the index.
+* *maxDocFreq*: The maximum *percentage* (0 to 1) of documents the term can appear in to be included in the index.
+* *exclude*: A comma delimited list of strings used to exclude terms. If a term contains any of the exclude strings that
+term will be excluded from the term vector.

http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/20dfd125/solr/solr-ref-guide/src/time-series.adoc
----------------------------------------------------------------------
diff --git a/solr/solr-ref-guide/src/time-series.adoc b/solr/solr-ref-guide/src/time-series.adoc
new file mode 100644
index 0000000..e765270
--- /dev/null
+++ b/solr/solr-ref-guide/src/time-series.adoc
@@ -0,0 +1,431 @@
+= Time Series
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+This section of the user guide provides an overview of time series *aggregation*,
+*smoothing* and *differencing*.
+
+== Time Series Aggregation
+
+The `timeseries` function performs fast, distributed time
+series aggregation leveraging Solr's builtin faceting and date math capabilities.
+
+The example below performs a monthly time series aggregation:
+
+[source,text]
+----
+timeseries(collection1,
+           q=*:*,
+           field="recdate_dt",
+           start="2012-01-20T17:33:18Z",
+           end="2012-12-20T17:33:18Z",
+           gap="+1MONTH",
+           format="YYYY-MM",
+           count(*))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "recdate_dt": "2012-01",
+        "count(*)": 8703
+      },
+      {
+        "recdate_dt": "2012-02",
+        "count(*)": 8648
+      },
+      {
+        "recdate_dt": "2012-03",
+        "count(*)": 8621
+      },
+      {
+        "recdate_dt": "2012-04",
+        "count(*)": 8533
+      },
+      {
+        "recdate_dt": "2012-05",
+        "count(*)": 8792
+      },
+      {
+        "recdate_dt": "2012-06",
+        "count(*)": 8598
+      },
+      {
+        "recdate_dt": "2012-07",
+        "count(*)": 8679
+      },
+      {
+        "recdate_dt": "2012-08",
+        "count(*)": 8469
+      },
+      {
+        "recdate_dt": "2012-09",
+        "count(*)": 8637
+      },
+      {
+        "recdate_dt": "2012-10",
+        "count(*)": 8536
+      },
+      {
+        "recdate_dt": "2012-11",
+        "count(*)": 8785
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 16
+      }
+    ]
+  }
+}
+----
+
+== Vectorizing the Time Series
+
+Before a time series result can be operated on by math expressions
+ the data will need to be vectorized. Specifically
+in the example above, the aggregation field count(*) will need to by moved into an array.
+As described in the Streams and Vectorization section of the user guide, the `col` function can be used
+to copy a numeric column from a list of tuples into an array.
+
+The expression below demonstrates the vectorization of the count(*) field.
+
+[source,text]
+----
+let(a=timeseries(collection1,
+                 q=*:*,
+                 field="test_dt",
+                 start="2012-01-20T17:33:18Z",
+                 end="2012-12-20T17:33:18Z",
+                 gap="+1MONTH",
+                 format="YYYY-MM",
+                 count(*)),
+    b=col(a, count(*)))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "b": [
+          8703,
+          8648,
+          8621,
+          8533,
+          8792,
+          8598,
+          8679,
+          8469,
+          8637,
+          8536,
+          8785
+        ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 5
+      }
+    ]
+  }
+}
+----
+
+== Smoothing
+
+Time series smoothing is often used to remove the noise from a time series and help
+spot the underlying trends.
+The math expressions library has three *sliding window* approaches
+for time series smoothing. The *sliding window* approaches use a summary value
+from a sliding window of the data to calculate a new set of smoothed data points.
+
+The three *sliding window* functions are lagging indicators, which means
+they don't start to move in the direction of the trend until the trend effects
+the summary value of the sliding window. Because of this lagging quality these smoothing
+functions are often used to confirm the direction of the trend.
+
+=== Moving Average
+
+The `movingAvg` function computes a simple moving average over a sliding window of data.
+The example below generates a time series, vectorizes the count(*) field and computes the
+moving average with a window size of 3.
+
+The moving average function returns an array that is of shorter length
+then the original data set. This is because results are generated only when a full window of data
+is available for computing the average. With a window size of three the moving average will
+begin generating results at the 3rd value. The prior values are not included in the result.
+
+This is true for all the sliding window functions.
+
+[source,text]
+----
+let(a=timeseries(collection1,
+                 q=*:*,
+                 field="test_dt",
+                 start="2012-01-20T17:33:18Z",
+                 end="2012-12-20T17:33:18Z",
+                 gap="+1MONTH",
+                 format="YYYY-MM",
+                 count(*)),
+    b=col(a, count(*)),
+    c=movingAvg(b, 3))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "c": [
+          8657.333333333334,
+          8600.666666666666,
+          8648.666666666666,
+          8641,
+          8689.666666666666,
+          8582,
+          8595,
+          8547.333333333334,
+          8652.666666666666
+        ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 7
+      }
+    ]
+  }
+}
+----
+
+=== Exponential Moving Average
+
+The `expMovingAvg` function uses a different formula for computing the moving average that
+responds faster to changes in the underlying data. This means that it is
+less of a lagging indicator then the simple moving average.
+
+Below is an example that computes an exponential moving average:
+
+[source,text]
+----
+let(a=timeseries(collection1, q=*:*,
+                 field="test_dt",
+                 start="2012-01-20T17:33:18Z",
+                 end="2012-12-20T17:33:18Z",
+                 gap="+1MONTH",
+                 format="YYYY-MM",
+                 count(*)),
+    b=col(a, count(*)),
+    c=expMovingAvg(b, 3))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "c": [
+          8657.333333333334,
+          8595.166666666668,
+          8693.583333333334,
+          8645.791666666668,
+          8662.395833333334,
+          8565.697916666668,
+          8601.348958333334,
+          8568.674479166668,
+          8676.837239583334
+        ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 5
+      }
+    ]
+  }
+}
+----
+
+=== Moving Median
+
+The `movingMedian` function uses the median of the sliding window rather than the average.
+In many cases the moving median will be more *robust* to outliers then moving averages.
+
+Below is an example computing the moving median:
+
+[source,text]
+----
+let(a=timeseries(collection1,
+                 q=*:*,
+                 field="test_dt",
+                 start="2012-01-20T17:33:18Z",
+                 end="2012-12-20T17:33:18Z",
+                 gap="+1MONTH",
+                 format="YYYY-MM",
+                 count(*)),
+    b=col(a, count(*)),
+    c=movingMedian(b, 3))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "c": [
+          8648,
+          8621,
+          8621,
+          8598,
+          8679,
+          8598,
+          8637,
+          8536,
+          8637
+        ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 7
+      }
+    ]
+  }
+}
+----
+
+== Differencing
+
+Differencing is often used to remove the
+trend or seasonality from a time series. This is known as making a time series
+*stationary*.
+
+=== First Difference
+
+The actual technique of differencing is to use the difference between values rather then the
+original values. The *first difference* takes the difference between a value and the value
+that came directly before it. The first difference is often used to remove the trend
+from a time series.
+
+In the example below, the `diff` function computes the first difference of a time series.
+The result array length is one value smaller then the original array.
+This is because the `diff` function only returns a result for values
+where the prior value has been subtracted.
+
+[source,text]
+----
+let(a=timeseries(collection1,
+                 q=*:*,
+                 field="test_dt",
+                 start="2012-01-20T17:33:18Z",
+                 end="2012-12-20T17:33:18Z",
+                 gap="+1MONTH",
+                 format="YYYY-MM",
+                 count(*)),
+    b=col(a, count(*)),
+    c=diff(b))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "c": [
+          -55,
+          -27,
+          -88,
+          259,
+          -194,
+          81,
+          -210,
+          168,
+          -101,
+          249
+        ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 11
+      }
+    ]
+  }
+}
+----
+
+=== Lagged Differences
+
+The `diff` function has an optional second parameter to specify a lag in the difference.
+If a lag is specified the difference is taken between a value and the value at a specified
+lag in the past. Lagged differences are often used to remove seasonality from a time series.
+
+The simple example below demonstrates how lagged differencing works.
+Notice that the array in the example follows a simple repeated pattern. This type of pattern
+is often displayed with seasonality. In this example we can remove this pattern using
+the `diff` function with a lag of 4. This will subtract the value lagging four indexes
+behind the current index. Notice that result set size is the original array size minus the lag.
+This is because the `diff` function only returns results for values where the lag of 4
+is possible to compute.
+
+[source,text]
+----
+let(a=array(1,2,5,2,1,2,5,2,1,2,5),
+     b=diff(a, 4))
+----
+
+Expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "b": [
+          0,
+          0,
+          0,
+          0,
+          0,
+          0,
+          0
+        ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 0
+      }
+    ]
+  }
+}
+----

http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/20dfd125/solr/solr-ref-guide/src/variables.adoc
----------------------------------------------------------------------
diff --git a/solr/solr-ref-guide/src/variables.adoc b/solr/solr-ref-guide/src/variables.adoc
new file mode 100644
index 0000000..7e12e75
--- /dev/null
+++ b/solr/solr-ref-guide/src/variables.adoc
@@ -0,0 +1,147 @@
+= Variables
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+== The Let Expressions
+
+The `let` expression sets variables and returns
+the value of the last variable by default. The output of any streaming expression
+or math expression can be set to a variable.
+
+Below is a simple example setting three variables *a*, *b*
+and *c*. Variables *a* and *b* are set to arrays. The variable *c* is set
+to the output of the `ebeAdd` function which performs element-by-element
+addition of the two arrays.
+
+Notice that the last variable, *c*, is returned.
+
+[source,text]
+----
+let(a=array(1, 2, 3),
+    b=array(10, 20, 30),
+    c=ebeAdd(a, b))
+----
+
+When this expression is sent to the /stream handler it
+responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "c": [
+          11,
+          22,
+          33
+        ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 4
+      }
+    ]
+  }
+}
+----
+
+== Echoing Variables
+
+All variables can be output by setting the *echo* variable to *true*.
+
+[source,text]
+----
+let(echo=true,
+    a=array(1, 2, 3),
+    b=array(10, 20, 30),
+    c=ebeAdd(a, b))
+----
+
+When this expression is sent to the /stream handler it
+responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "a": [
+          1,
+          2,
+          3
+        ],
+        "b": [
+          10,
+          20,
+          30
+        ],
+        "c": [
+          11,
+          22,
+          33
+        ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 0
+      }
+    ]
+  }
+}
+----
+
+A specific set of variables can be echoed by providing a comma delimited
+list of variables to the echo parameter.
+
+[source,text]
+----
+let(echo="a,b",
+    a=array(1, 2, 3),
+    b=array(10, 20, 30),
+    c=ebeAdd(a, b))
+----
+
+When this expression is sent to the /stream handler it
+responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "a": [
+          1,
+          2,
+          3
+        ],
+        "b": [
+          10,
+          20,
+          30
+        ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 0
+      }
+    ]
+  }
+}
+----
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/20dfd125/solr/solr-ref-guide/src/vector-math.adoc
----------------------------------------------------------------------
diff --git a/solr/solr-ref-guide/src/vector-math.adoc b/solr/solr-ref-guide/src/vector-math.adoc
new file mode 100644
index 0000000..22d610f
--- /dev/null
+++ b/solr/solr-ref-guide/src/vector-math.adoc
@@ -0,0 +1,343 @@
+= Vector Math
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+This section of the user guide covers vector math and
+vector manipulation functions.
+
+== Arrays
+
+Arrays can be created with the `array` function.
+
+For example the expression below creates a numeric array with
+three elements:
+
+[source,text]
+----
+array(1, 2, 3)
+----
+
+When this expression is sent to the /stream handler it responds with
+a json array.
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "return-value": [
+          1,
+          2,
+          3
+        ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 0
+      }
+    ]
+  }
+}
+----
+
+== Array Operations
+
+Arrays can be passed as parameters to functions that operate on arrays.
+
+For example, an array can be reversed with the `rev` function:
+
+[source,text]
+----
+rev(array(1, 2, 3))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "return-value": [
+          3,
+          2,
+          1
+        ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 0
+      }
+    ]
+  }
+}
+----
+
+Another example is the `length` function,
+which returns the length of an array:
+
+[source,text]
+----
+length(array(1, 2, 3))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "return-value": 3
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 0
+      }
+    ]
+  }
+}
+----
+
+A slice of an array can be taken with the `copyOfRange` function, which
+copies elements of an array from a start and end range.
+
+[source,text]
+----
+copyOfRange(array(1,2,3,4,5,6), 1, 4)
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "return-value": [
+          2,
+          3,
+          4
+        ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 0
+      }
+    ]
+  }
+}
+----
+
+== Vector Summarizations and Norms
+
+There are a set of functions that perform
+summerizations and return norms of arrays. These functions
+operate over an array and return a single
+value. The following vector summarizations and norm functions are available:
+`mult`, `add`, `sumSq`, `mean`, `l1norm`, `l2norm`, `linfnorm`.
+
+The example below is using the `mult` function,
+which multiples all the values of an array.
+
+[source,text]
+----
+mult(array(2,4,8))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "return-value": 64
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 0
+      }
+    ]
+  }
+}
+----
+
+The vector norm functions provide different formulas for calculating vector magnitude.
+
+The example below calculates the *l2norm* of an array.
+
+[source,text]
+----
+l2norm(array(2,4,8))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "return-value": 9.16515138991168
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 0
+      }
+    ]
+  }
+}
+----
+
+== Scalar Vector Math
+
+Scalar vector math functions add, subtract, multiple or divide a scalar value with every value in a vector.
+The following functions perform these operations: `scalarAdd`, `scalarSubtract`, `scalarMultiply`
+and `scalarDivide`.
+
+
+Below is an example of the `scalarMultiply` function, which multiplies the scalar value 3 with
+every value of an array.
+
+[source,text]
+----
+scalarMultiply(3, array(1,2,3))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "return-value": [
+          3,
+          6,
+          9
+        ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 0
+      }
+    ]
+  }
+}
+----
+
+== Element-By-Element Vector Math
+
+Two vectors can be added, subtracted, multiplied and divided using element-by-element
+vector math functions. The following element-by-element vector math functions are:
+`ebeAdd`, `ebeSubtract`, `ebeMultiply`, `ebeDivide`.
+
+The expression below performs the element-by-element subtraction of two arrays.
+
+[source,text]
+----
+ebeSubtract(array(10, 15, 20), array(1,2,3))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "return-value": [
+          9,
+          13,
+          17
+        ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 5
+      }
+    ]
+  }
+}
+----
+
+== Dot Product and Cosine Similarity
+
+The `dotProduct` and `cosineSimilarity` functions are often used as similarity measures between two
+sparse vectors. The `dotProduct` is a measure of both angle and magnitude while `cosineSimilarity`
+is a measure only of angle.
+
+Below is an example of the `dotProduct` function:
+
+[source,text]
+----
+dotProduct(array(2,3,0,0,0,1), array(2,0,1,0,0,3))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "return-value": 7
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 15
+      }
+    ]
+  }
+}
+----
+
+Below is an example of the `cosineSimilarity` function:
+
+[source,text]
+----
+cosineSimilarity(array(2,3,0,0,0,1), array(2,0,1,0,0,3))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "return-value": 0.5
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 7
+      }
+    ]
+  }
+}
+----
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/20dfd125/solr/solr-ref-guide/src/vectorization.adoc
----------------------------------------------------------------------
diff --git a/solr/solr-ref-guide/src/vectorization.adoc b/solr/solr-ref-guide/src/vectorization.adoc
new file mode 100644
index 0000000..b01dcc8
--- /dev/null
+++ b/solr/solr-ref-guide/src/vectorization.adoc
@@ -0,0 +1,243 @@
+= Streams and Vectorization
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+This section of the user guide explores techniques
+for retrieving streams of data from Solr and vectorizing the
+*numeric* fields.
+
+The next chapter of the user guide covers
+Text Analysis and Term Vectors which describes how to
+vectorize *text* fields.
+
+== Streams
+
+Streaming Expressions has a wide range of stream sources that can be used to
+retrieve data from Solr Cloud collections. Math expressions can be used
+to vectorize and analyze the results sets.
+
+Below are some of the key stream sources:
+
+* *random*: Random sampling is widely used in statistics, probability and machine learning.
+The `random` function returns a random sample of search results that match a
+query. The random samples can be vectorized and operated on by math expressions and the results
+can be used to describe and make inferences about the entire population.
+
+* *timeseries*: The `timeseries`
+expression provides fast distributed time series aggregations, which can be
+vectorized and analyzed with math expressions.
+
+* *knnSearch*: K-nearest neighbor is a core machine learning algorithm. The `knnSearch`
+function is a specialized knn algorithm optimized to find the k-nearest neighbors of a document in
+a distributed index. Once the nearest neighbors are retrieved they can be vectorized
+and operated on by machine learning and text mining algorithms.
+
+* *sql*: SQL is the primary query language used by data scientists. The `sql` function supports
+data retrieval using a subset of SQL which includes both full text search and
+fast distributed aggregations. The result sets can then be vectorized and operated
+on by math expressions.
+
+* *jdbc*: The `jdbc` function allows data from any JDBC compliant data source to be combined with
+streams originating from Solr. Result sets from outside data sources can be vectorized and operated
+on by math expressions in the same manner as result sets originating from Solr.
+
+* *topic*: Messaging is an important foundational technology for large scale computing. The `topic`
+function provides publish/subscribe messaging capabilities by treating
+Solr Cloud as a distributed message queue. Topics are extremely powerful
+because they allow subscription by query. Topics can be use to support a broad set of
+use cases including bulk text mining operations and AI alerting.
+
+* *nodes*: Graph queries are frequently used by recommendation engines and are an important
+machine learning tool. The `nodes` function provides fast, distributed, breadth
+first graph traversal over documents in a Solr Cloud collection. The node sets collected
+by the `nodes` function can be operated on by statistical and machine learning expressions to
+gain more insight into the graph.
+
+* *search*: Ranked search results are a powerful tool for finding the most relevant
+documents from a large document corpus. The `search` expression
+returns the top N ranked search results that match any
+Solr query, including geo-spatial queries. The smaller set of relevant
+documents can then be explored with statistical, machine learning and
+text mining expressions to gather insights about the data set.
+
+== Assigning Streams to Variables
+
+The output of any streaming expression can be set to a variable.
+Below is a very simple example using the `random` function to fetch
+three random samples from collection1. The random samples are returned
+as *tuples*, which contain name/value pairs.
+
+
+[source,text]
+----
+let(a=random(collection1, q="*:*", rows="3", fl="price_f"))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "a": [
+          {
+            "price_f": 0.7927976
+          },
+          {
+            "price_f": 0.060795486
+          },
+          {
+            "price_f": 0.55128294
+          }
+        ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 11
+      }
+    ]
+  }
+}
+----
+
+== Creating a Vector with the *col* Function
+
+The `col` function iterates over a list of tuples and copies the values
+from a specific column into an *array*.
+
+The output of the `col` function is an numeric array that can be set to a
+variable and operated on by math expressions.
+
+Below is an example of the `col` function:
+
+[source,text]
+----
+let(a=random(collection1, q="*:*", rows="3", fl="price_f"),
+    b=col(a, price_f))
+----
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "b": [
+          0.42105234,
+          0.85237443,
+          0.7566981
+        ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 9
+      }
+    ]
+  }
+}
+----
+
+== Applying Math Expressions to the Vector
+
+Once a vector has been created any math expression that operates on vectors
+can be applied. In the example below the `mean` function is applied to
+the vector assigned to variable *b*.
+
+[source,text]
+----
+let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
+    b=col(a, price_f),
+    c=mean(b))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "c": 0.5016035594638814
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 306
+      }
+    ]
+  }
+}
+----
+
+== Creating Matrices
+
+Matrices can be created by vectorizing multiple numeric fields
+and adding them to a matrix. The matrices can then be operated on by
+any math expression that operates on matrices.
+
+Note that this section deals with the creation of matrices
+from numeric data. The next chapter of the user guide covers
+Text Analysis and Term Vectors which describes how to build TF-IDF
+term vector matrices from text fields.
+
+Below is a simple example where four random samples are taken
+from different sub-populations in the data. The *price_f* field of
+each random sample is
+vectorized and the vectors are added as rows to a matrix.
+Then the `sumRows`
+function is applied to the matrix to return a vector containing
+the sum of each row.
+
+[source,text]
+----
+let(a=random(collection1, q="market:A", rows="5000", fl="price_f"),
+    b=random(collection1, q="market:B", rows="5000", fl="price_f"),
+    c=random(collection1, q="market:C", rows="5000", fl="price_f"),
+    d=random(collection1, q="market:D", rows="5000", fl="price_f"),
+    e=col(a, price_f),
+    f=col(b, price_f),
+    g=col(c, price_f),
+    h=col(d, price_f),
+    i=matrix(e, f, g, h),
+    j=sumRows(i))
+----
+
+When this expression is sent to the /stream handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "j": [
+          154390.1293375,
+          167434.89453,
+          159293.258493,
+          149773.42769,
+        ]
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 9
+      }
+    ]
+  }
+}
+----
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/20dfd125/solr/solrj/src/java/org/apache/solr/client/solrj/io/eval/FieldValueEvaluator.java
----------------------------------------------------------------------
diff --git a/solr/solrj/src/java/org/apache/solr/client/solrj/io/eval/FieldValueEvaluator.java b/solr/solrj/src/java/org/apache/solr/client/solrj/io/eval/FieldValueEvaluator.java
index fac4274..a12a74e 100644
--- a/solr/solrj/src/java/org/apache/solr/client/solrj/io/eval/FieldValueEvaluator.java
+++ b/solr/solrj/src/java/org/apache/solr/client/solrj/io/eval/FieldValueEvaluator.java
@@ -31,10 +31,12 @@ public class FieldValueEvaluator extends SourceEvaluator {
   private static final long serialVersionUID = 1L;
   
   private String fieldName;
+  private boolean literal;
   
   public FieldValueEvaluator(String fieldName) {
-    if(fieldName.startsWith("'") && fieldName.endsWith("'") && fieldName.length() > 1){
+    if(fieldName.startsWith("\"") && fieldName.endsWith("\"") && fieldName.length() > 1){
       fieldName = fieldName.substring(1, fieldName.length() - 1);
+      literal = true;
     }
     
     this.fieldName = fieldName;
@@ -42,6 +44,10 @@ public class FieldValueEvaluator extends SourceEvaluator {
   
   @Override
   public Object evaluate(Tuple tuple) throws IOException {
+    if(literal) {
+      return fieldName;
+    }
+
     Object value = tuple.get(fieldName);
     
     // This is somewhat radical.
@@ -84,10 +90,6 @@ public class FieldValueEvaluator extends SourceEvaluator {
       }
     }
 
-    if(value == null) {
-      return fieldName;
-    }
-
     return value;
   }
   


Mime
View raw message