HAWQ49. Remove madlib related codes and tests.
Project: http://gitwipus.apache.org/repos/asf/incubatorhawq/repo
Commit: http://gitwipus.apache.org/repos/asf/incubatorhawq/commit/f58baae0
Tree: http://gitwipus.apache.org/repos/asf/incubatorhawq/tree/f58baae0
Diff: http://gitwipus.apache.org/repos/asf/incubatorhawq/diff/f58baae0
Branch: refs/heads/master
Commit: f58baae0194fb94fc83b1dc7b8e3c7f5994ad115
Parents: 5199982
Author: Ming LI <mli@pivotal.io>
Authored: Mon Feb 15 15:39:07 2016 +0800
Committer: Ming LI <mli@pivotal.io>
Committed: Tue Feb 16 09:31:57 2016 +0800

GNUmakefile.in  4 
contrib/gp_sparse_vector/Makefile  38 
contrib/gp_sparse_vector/README  132 
contrib/gp_sparse_vector/README.bitmap  259 
contrib/gp_sparse_vector/README.devel  63 
contrib/gp_sparse_vector/README.update  72 
contrib/gp_sparse_vector/README_INSTALL  9 
contrib/gp_sparse_vector/README_term_proximity  118 
contrib/gp_sparse_vector/SparseData.c  1572 
contrib/gp_sparse_vector/SparseData.h  234 
contrib/gp_sparse_vector/bugs  122 
contrib/gp_sparse_vector/float_specials.h  171 
contrib/gp_sparse_vector/operators.c  939 
contrib/gp_sparse_vector/sfv_test_output  141 
contrib/gp_sparse_vector/sparse_vector.c  411 
contrib/gp_sparse_vector/sparse_vector.h  175 
contrib/gp_sparse_vector/test_output  296 
pom.xml  2 
src/backend/catalog/Makefile  2 +
src/backend/catalog/madlib.sql  492 
src/test/regress/expected/madlib_svec_test.out  236 
src/test/regress/known_good_schedule  2 
src/test/regress/sql/madlib_svec_test.sql  56 
src/tools/msvc/Mkvcbuild.pm  2 +
24 files changed, 2 insertions(+), 5546 deletions()

http://gitwipus.apache.org/repos/asf/incubatorhawq/blob/f58baae0/GNUmakefile.in

diff git a/GNUmakefile.in b/GNUmakefile.in
index 40e887b..f20ee2c 100644
 a/GNUmakefile.in
+++ b/GNUmakefile.in
@@ 14,7 +14,6 @@ all:
$(MAKE) C config all
$(MAKE) C contrib/formatter_fixedwidth all
$(MAKE) C contrib/extprotocol all
 $(MAKE) C contrib/gp_sparse_vector all
$(MAKE) C contrib/orafce all
$(MAKE) C contrib/gp_cancel_query all
$(MAKE) C contrib/hawqhadoop all
@@ 27,7 +26,6 @@ install:
$(MAKE) C config $@
$(MAKE) C contrib/formatter_fixedwidth $@
$(MAKE) C contrib/extprotocol $@
 $(MAKE) C contrib/gp_sparse_vector $@
$(MAKE) C contrib/orafce $@
$(MAKE) C contrib/gp_cancel_query $@
$(MAKE) C contrib/hawqhadoop $@
@@ 40,7 +38,6 @@ installdirs uninstall:
$(MAKE) C config $@
$(MAKE) C contrib/formatter_fixedwidth $@
$(MAKE) C contrib/extprotocol $@
 $(MAKE) C contrib/gp_sparse_vector $@
$(MAKE) C contrib/orafce $@
$(MAKE) C contrib/gp_cancel_query $@
$(MAKE) C contrib/hawqhadoop $@
@@ 60,7 +57,6 @@ clean:
$(MAKE) C config $@
$(MAKE) C contrib/formatter_fixedwidth $@
$(MAKE) C contrib/extprotocol $@
 $(MAKE) C contrib/gp_sparse_vector $@
$(MAKE) C contrib/orafce $@
$(MAKE) C contrib/gp_cancel_query $@
$(MAKE) C contrib/hawqhadoop $@
http://gitwipus.apache.org/repos/asf/incubatorhawq/blob/f58baae0/contrib/gp_sparse_vector/Makefile

diff git a/contrib/gp_sparse_vector/Makefile b/contrib/gp_sparse_vector/Makefile
deleted file mode 100644
index 50468be..0000000
 a/contrib/gp_sparse_vector/Makefile
+++ /dev/null
@@ 1,38 +0,0 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
#Intel compiler
ifeq "$(CC)" "icc"
 USE_ICC = 1
endif

override CFLAGS+=std=gnu99

MODULE_big = gp_svec
OBJS = sparse_vector.o operators.o SparseData.o
ifdef USE_PGXS
 PGXS := $(shell pg_config pgxs)
 include $(PGXS)
else
 subdir = contrib/gp_sparse_vector
 top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contribglobal.mk
endif
ifdef USE_ICC
 override CFLAGS=O3 Werror std=c99 vecreport2 vecthreshold0
endif
http://gitwipus.apache.org/repos/asf/incubatorhawq/blob/f58baae0/contrib/gp_sparse_vector/README

diff git a/contrib/gp_sparse_vector/README b/contrib/gp_sparse_vector/README
deleted file mode 100644
index a05985e..0000000
 a/contrib/gp_sparse_vector/README
+++ /dev/null
@@ 1,132 +0,0 @@

 Sparse Vector Datatype

====================
Background
====================

When you use arrays of floating point numbers for various calculations, you will often have
long runs of zeros. This is common in many kinds of applications such as scientific, retail
optimization and text processing. Each floating point number takes 8 bytes of storage in
memory and/or disk, so saving those zeros is often impractical. There are also many computations
that benefit from skipping over the zeros.

Say for example that we have the following array of doubles in Postgres stored as a "float8[]":
 '{0, 33,...40,000 zeros..., 12, 22 }'::float8[]

This array would occupy slightly more than 320KB of memory/disk, most of it zeros. Also,
if we perform various operations on this, we'll be doing work on 40,001 fields that aren't
important. This kind of array arises often in text processing, where the dictionary may have
40100K terms and each vector will store the number of words in a particular document.

We could use the built in "Array" datatype in Postgres/GP, which has a bitmap for null values,
but since it is not optimized for float8[], or for long runs of zeros instead of nulls and
the bitmap is not RLE compressed, it is a poor choice for this usecase. If we stored the
zeros as NULL in the Postgres/GP array, then the bitmap for nulls would have 5KB of bits in
it to mark the nulls, which is not nearly as efficient as we'd like.

A simple RLE scheme that is biased toward being efficient for zero value bitmap results in
6 bytes for bitmap storage.

We also want to implement vector operators for our type that take advantage of an efficient
sparse storage format to make computations faster.

====================
What is it?
====================

The "Sparse Vector Datatype" is named "svec" and implements a vector with compressed storage
of zeros. You can take the above vector and cast it into an svec like this:
 ('{0, 33,...40,000 zeros..., 12, 22 }'::float8[])::svec

This is the same vector we input (I only used 5 zeros).

We can use operations with svec type like <, >, *, **, /,=,+,SUM,COUNT_VEC etc, and
they have meanings associated with typical vector operations. The plus (+) operator adds
each of the terms of two vectors of same dimension together. For instance, if we have a =
{0,1,5} and b = {4,3,2}, then a+b = {4,4,7}. We can see this using the svec type like this:
 lukelonergan=# select ('{0,1,5}'::float8[]::svec + '{4,3,2}'::float8[]::svec)::float8[];
 float8
 
 {4,4,7}

A vector dot product (*) between these two will result in a scalar result of type float8.
The dot product should be (0*4+1*3+5*2)=13, like this:
 lukelonergan=# select '{0,1,5}'::float8[]::svec * '{4,3,2}'::float8[]::svec;
 ?column?
 
 13

A scalar product (**) between these two will result in the individual terms being multiplied.
The scalar product should be {0*4,1*3,5*2}={0,3,10}, like this:
 lukelonergan=# select ('{0,1,5}'::float8[]::svec ** '{4,3,2}'::float8[]::svec)::float8[];
 float8
 
 {0,3,10}

Special vector aggregate functions are also useful. SUM is self explanatory. COUNT_VEC
(as opposed to COUNT) evaluates the count of nonzero terms found in a set of svec and returns
an svec with the counts. For instance, if we have {0,1,5},{10,0,3},{0,0,3},{0,1,0}, then
COUNT_VEC() would return {1,2,3}:
 lukelonergan=# create table list (a svec);
 CREATE TABLE
 lukelonergan=# insert into list values ('{0,1,5}'::float8[]),('{10,0,3}'::float8[]),('{0,0,3}'::float8[]),('{0,1,0}'::float8[]);
 INSERT 0 4
 lukelonergan=# select COUNT_VEC(a)::float8[] from list;
 count_vec
 
 {1,2,3}

====================
Example
====================
For a text classification example, lets assume we have a dictionary composed of words in
a text array:
create table features (a text[][]) distributed randomly;
insert into features values ('{am,before,being,bothered,corpus,document,i,in,is,me,never,now,one,really,second,the,third,this,until}');
We have a set of documents, also defined as arrays of words:
 lukelonergan=# create table documents(a int,b text[]);
 CREATE TABLE
 lukelonergan=# insert into documents values (1,'{this,is,one,document,in,the,corpus}');
 lukelonergan=# insert into documents values (2,'{i,am,the,second,document,in,the,corpus}');
 lukelonergan=# insert into documents values (3,'{being,third,never,really",bothered,me,until,now}');
 lukelonergan=# insert into documents values (4,'{the,document,before,me,is,the,third,document}');

Now we have a dictionary and some documents, we would like to do some document classification
using vector arithmetic on word counts and proportions of dictionary words in each document.

To start this process, well need to find the dictionary words in each document. Well prepared
what is called a Sparse Feature Vector or SFV for each document. An SFV is a vector of dimension
N, where N is the number of dictionary words, and in each SFV is a count of each dictionary
word in the document.

Inside the sparse vector toolset, we have a function that will create an SFV from a document,
so we can just do this:
 lukelonergan=# select gp_extract_feature_histogram((select a from features limit 1),b)::float8[]
from documents;
 gp_extract_feature_histogram
 
 {0,0,0,0,1,1,0,1,1,0,0,0,1,0,0,1,0,1,0}
 {0,0,1,1,0,0,0,0,0,1,1,1,0,1,0,0,1,0,1}
 {1,0,0,0,1,1,1,1,0,0,0,0,0,0,1,2,0,0,0}
 {0,1,0,0,0,2,0,0,1,1,0,0,0,0,0,2,1,0,0}

Note that the output of gp_extract_feature_histogram is an svec for each document containing
the count of each of the dictionary words in the ordinal positions of the dictionary. This
can more easily be understood by doing this:
 lukelonergan=# select gp_extract_feature_histogram((select a from features limit 1),b)::float8[],b
from documents;
 gp_extract_feature_histogram  b
 +
 {1,0,0,0,1,1,1,1,0,0,0,0,0,0,1,2,0,0,0}  {i,am,the,second,document,in,the,corpus}
 {0,1,0,0,0,2,0,0,1,1,0,0,0,0,0,2,1,0,0}  {the,document,before,me,is,the,third,document}
 {0,0,0,0,1,1,0,1,1,0,0,0,1,0,0,1,0,1,0}  {this,is,one,document,in,the,corpus}
 {0,0,1,1,0,0,0,0,0,1,1,1,0,1,0,0,1,0,1}  {being,third,never,really,bothered,me,until,now}

 lukelonergan=# select * from features;
 a
 
 {am,before,being,bothered,corpus,document,i,in,is,me,never,now,one,really,second,the,third,this,until}

Now when we look at the first document "i am the second document in the corpus", its SFV
is {1,3*0,1,1,1,1,6*0,1,2}. The word "am" is the first ordinate in the dictionary and there
is 1 instance of it in the SFV. "before" has no instances in the document, so its value is
"0" and so on.

gp_extract_feature_histogram is very speed optimized  in essence a single routine version
of a hash join, and should be able to process large numbers of documents into their SFVs in
parallel at the highest possible speeds.

The rest of the classification process is all vector math. As Brian Dolan describes the
process:
 The actual count is hardly ever used. Instead, it's turned into a weight. The most common
weight is called tf/idf for "Term Frequency / Inverse Document Frequency".
 The calculation for a given term in a given document is {#Times in this document} * log
{#Documents / #Documents the term appears in}
 For instance, the term "document" in document A would have weight 1 * log (4/3). In document
D it would have weight 2 * log (4/3).
 Terms that appear in every document would have tf/idf weight 0, since log (4/4) = log(1)
= 0. (Our example has no term like that.) That usually sends a lot of values to 0.

For this part of the processing, well need to have a sparse vector of the dictionary dimension
(19) with the values (log(#documents/#Documents each term appears in). There will be one
such vector for the whole list of documents (aka the "corpus"). The #documents is just a
count of all of the documents, in this case 4, but there is one divisor for each dictionary
word and its value is the count of all the times that word appears in the document. This
single vector for the whole corpus can then be scalar product multiplied by each document
SFV to produce the Term Frequency/Inverse Document Frequency.

This can be done as follows:
 lukelonergan=# create table corpus as (select a,gp_extract_feature_histogram((select a from
features limit 1),b) sfv from documents);
 lukelonergan=# SELECT docnum, (a*logidf) tf_idf FROM (SELECT log(count(a)/count_vec(a))
logidf FROM corpus) foo, corpus ORDER BY docnum;
 docnum  tf_idf
 +
 1  {4,1,1,1,2,3,1,2,1,1,1,1}:{0,0.693147180559945,0.287682072451781,0,0.693147180559945,0,1.38629436111989,0,0.287682072451781,0,1.38629436111989,0}
 2  {1,3,1,1,1,1,6,1,1,3}:{1.38629436111989,0,0.693147180559945,0.287682072451781,1.38629436111989,0.693147180559945,0,1.38629436111989,0.575364144903562,0}
 3  {2,2,5,1,2,1,1,2,1,1,1}:{0,1.38629436111989,0,0.693147180559945,1.38629436111989,0,1.38629436111989,0,0.693147180559945,0,1.38629436111989}
 4  {1,1,3,1,2,2,5,1,1,2}:{0,1.38629436111989,0,0.575364144903562,0,0.693147180559945,0,0.575364144903562,0.693147180559945,0}

We can now get the "angular distance" between one document and the rest of the documents
using the ACOS of the dot product of the document vectors:
 lukelonergan=# CREATE TABLE weights AS (SELECT docnum, (a*logidf) tf_idf FROM (SELECT log(count(a)/count_vec(a))
logidf FROM corpus) foo, corpus ORDER BY docnum) DISTRIBUTED RANDOMLY;
 Calculate the angular distance between the first document to each other document
 lukelonergan=# SELECT docnum,180.*(ACOS(dmin(1.,(tf_idf%*%testdoc)/(svec_l2norm(tf_idf)*svec_l2norm(testdoc))))/3.141592654)
angular_distance FROM weights,(SELECT tf_idf testdoc FROM weights WHERE docnum = 1 LIMIT 1)
foo ORDER BY 1;
 docnum  angular_distance
 +
 1  0
 2  78.8235846096986
 3  89.9999999882484
 4  80.0232034288617

We can see that the angular distance between document 1 and itself is 0 degrees and between
document 1 and 3 is 90 degrees because they share no features at all.
http://gitwipus.apache.org/repos/asf/incubatorhawq/blob/f58baae0/contrib/gp_sparse_vector/README.bitmap

diff git a/contrib/gp_sparse_vector/README.bitmap b/contrib/gp_sparse_vector/README.bitmap
deleted file mode 100644
index 901a3f5..0000000
 a/contrib/gp_sparse_vector/README.bitmap
+++ /dev/null
@@ 1,259 +0,0 @@
Revision: 02/03/10

I’m spending most of my part time on working out the actual mechanics (with examples) of
the matrix/vector implementation.

What will help is to get some input on likely usecases that can drive the work – though
I’ve got a good number of examples from scientific and econometric modeling, I could always
benefit from something concrete and immediate.

Following is more on mechanics.

Another finding on indexes, RLE compression and matrices:
input matrices come in various flavors, many of them in nonsorted “not ready for sparse
compression” formats
in many circumstances there is no apriori choice of whether there are zeros or NVP
Because of (1), in the input processing and sorting, it is necessary to create an uncompressed
index that is identical to the single RLE index I’ve described previously, sans the special
values.

I’ve described the input processing here in a program I wrote to process matrices from
the Matrix Mart at NIST:
 /* The Matrix Market format does not specify the order in which the
 * successive elements are stored. We need to compress the matrix into
 * one of CRS or CCS format (row or column compression), so we need to
 * sort the matrix into one of those forms.
 *
 * Here we'll use quicksort to sort the matrix elements into row
 * contiguous storage.
 *
 * If we desire row contiguous storage, we'll also need the column order
 * within the row preserved. We'll do this by computing the index into
 * the matrix using a "columns progressive" ordering, which is computed
 * as NJ * I + J. Then we'll sort based on that index. The index value
 * will range from 0 to NJ*NI1. For any index value, you can compute I
 * and J using:
 * I = index/NJ
 * J = index  NJ * I
 *
 * If we sort the matrix by this index, we'll have things in row/column
 * order, suitable for CRS compression.
 *
 * Note that if we want CCS, we can construct the index using:
 * index = NI * J + I
 */

In the Yale/MKL formats, we deconstruct the single index into a set of two values for each
run and that is stored as the compression index.

I’m thinking now that for the RLE index we can do something along these lines:
Treat all values symmetrically, that is: all runs of the various types we’re discussing
(inf,inf,0,NVP,value) are described by the same kind of word
Each RLE word is composed of a leading set of descriptor bits, followed by a payload describing
how many of the value are found.

So if we pick the descriptor bits = 3, we can represent 8 “things”, of which 5 things
must be “what kind of value is this?”, leaving 3 things. A good use of the 3 things can
be word length, one of 8, 16 or 32.

So if we go down this path, we’re left with 5, 13 and 29 bits to represent a run and can
dynamically choose which one to use based on the data content. That’s 32, 8192 and 536,870,912
values in each run, which is perfect for capturing various kinds of sparsity.

If we expand to one more descriptor bit, then we can represent 8 more things, which could
be used to recognize additional types of values to compress out. That would leave us with
16, 4096 and 268,435,456 values in each run, which also works pretty well and leaves us room
for the future in case we find there are more values than {inf,inf,0,NVP,value}.

Some additional results to share (see table below) – I did a computation of the various
storage and compute requirements for the MKL/Yale index and the RLE index. I used 5 matrix
examples from NIST’s “Matrix Market”, read them in and translated them from Matrix Market
format to MKL/RLE which required sorting them into CRS. From this, I calculated the relevant
stats on each in terms of storage and number of words in each of the RLE and MKL indices.

Bottom line is that the results corroborated my earlier calculations. The RLE index is a
lot smaller than MKL always, and is more computationally efficient for linear scans through
the compressed matrix. This means we should use only RLE for most usecases and calculate
MKL only when we need to use the algorithms in the matrix solver libraries. We can optionally
store the MKL index if we calculate it – we can provide a marker to indicate the presence/absence
of it in the type.

I’ve also thought through the compatibility with all the sparse solver libs – seems like
we have easy compatibility with a CRSbased approach, so I think that should be the baseline.
We can provide another marker to indicate what is being stored – CRS, CCS, CDS, etc.

I believe I have enough information to produce a draft design of the matrix type that we
can review and get some other feedback about. Here is a sketch of how I think it should work:
Two separate types, one for vectors and one for matrices – but they share the same format
and support routines. The separate types allows definition of vector and matrix specific
operators.
The format stores the following metadata about the datum:
 nD, (n(i),i=1,nD): dimension and rank – for a vector, this is {1,N}; for a matrix it’s
{2,Nrows,Ncolumns}
 nZ: number of values (nonspecial values)
 valType: type of value, one of {bit, byte, int4, int8, real4, complex4, real8, complex8}
 storType: type of storage, one of {CRS, CCS, CDS, ... Up to maybe 32 total with some
reserved}
 specValList: a list of special values to be used in the compression, up to 8, e.g. {inf,inf,NVP,0}
 hasMKL: if true, the MKL index is stored
The format stores the following data:
 values: a dense array of values, size is (nZ*sizeof(valType))
 rleIndex: an RLE index containing location of all values (including specials)
 sizeofRleIndex: the size of the RLE index for allocation purposes
 if (hasMKL) mklIndex: the MKL/Yale index, size is ((nZ+NI)*sizeof(int))

01234567890123456789012345678901234567890123456789012345678901234567890123456789

In order to efficiently compress the storage of multiple data types, we are
using a bitmap that describes the positions of four special 64bit floating
point values: zero, negative infinity, positive infinity, and "no value present"
or NVP. Of these, the first three are all standard floating point values that
can be the natural result of performing calculations. NVP is a special case
often used in statistics and other analytical processing but is not
representable as a standard floating point number.

Here we're using two patterns for sparse storage and computation:
1) recognition of compressible types and subsequent bitmap compression
2) optimized computational usage of previously compressed values

(1) recognizes that we're going to have many situations where special values
arise as a natural consequence of computation. In these cases, we will
automatically locate and compress special values into a special bitmap storage
format.

(2) leverages the preexisting compressed special value storage for optimized
computation. For example, if we have a vector of values V and a large fraction
of the values in V are zero, then computing "W = 1/V" will produce a
correspondingly large number of positive infinity values in W. With our
compressed bitmap storage of zeros, we can loop over the compressed zero
elements and change their storage type from "zero" to "infinity", thereby
saving the computation, memory and recompression overheads.

It's important to have a bitmap format that is both storage efficient and
optimal for computation. There are many existing formats for sparse vector and
matrix storage, most of them developed for scientific computation. The Intel
Math Kernel Library (MKL) specifies the most common sparse matrix storage
format, derived from the Yale Sparse Matrix format:

 "The Intel MKL direct sparse solvers use a rowmajor upper triangular storage
format: the matrix is compressed rowbyrow and for symmetric matrices only non
zero elements in the upper triangular half of the matrix are stored."

 The Intel MKL sparse matrix storage format for direct sparse solvers is
specified by three arrays: values, columns, and rowIndex. The following table
describes the arrays in terms of the values, row, and column positions of the
nonzero elements in a sparse matrix.
 values  A real or complex array that contains the nonzero elements
 of a sparse matrix. The nonzero elements are mapped into the
 values array using the rowmajor upper triangular storage
 mapping described above.
 columns  Element I of the integer array columns is the number of the
 column that contains the Ith element in the values array.
 rowIndex  Element j of the integer array rowIndex gives the index of
 the element in the values array that is first nonzero
 element in a row j.

 The length of the values and columns arrays is equal to the number of nonzero
elements in the matrix."

 Intel Math Kernel Library Reference Manual, March 2009, Document Number:
 630813031US

The MKL format stores the nonzero values compactly in a sequential array of
either complex or double precision entries and maintains two index arrays that
identify the location of the nonzero values in a two dimensional matrix. The
advantages of the MKL format are it's efficiency for computation and the fact
it's in widespread usage in numerical libraries.

The uncompressed size of a real matrix with NI rows and NJ columns would be:
 Raw size = (8 bytes) * ( NI*NJ )

The storage required by the MKL format for a real matrix with N nonzero values
and NI number of rows is:
 Comp Size = (8 bytes) * ( 2N + NI )

For a complex matrix it's:
 Comp Size = (8 bytes) * ( 3N + NI )

Another way of representing the storage is to assess the "overhead" associated
with the representation of the zero entries, which is the same for real and
complex:
 Overhead = (8 bytes) * ( N + NI )

Now we can calculate the "savings" associated with using the MLK compressed
format for a real matrix:
 Savings = (Raw Size)  (Comp Size)
 (8 bytes) ( NI*NJ  2N  NI )

If we have a dense matrix, then N will be NI*NJ and the savings is equal to the
negative overhead, which means we'll have a worst case savings of:
 Dense (worst case) savings = (8 bytes)(NI*NJ + NI)
 = (8 bytes)*NI*(NJ+1)

For a dense matrix, this is a 100%+ storage penalty.

There is one important shortcoming of the MKL format: it can only represent one
type of special value  zero. This is a consequence of the implicit nature of
the approach, the location of the zero values is not stored, it is implied by
the lack of a nonzero value in that matrix position. There is no place to
provide an attribute of the special value so there can be only one type of
special value.

We need an alternative format that can represent different types of special
values. It will need to refer to the special values explicitly in an efficient
manner. We can take advantage of the nature of most all sparse storage
circumstances: the special values occur in relatively large runs compared to
the nonzero values. We can use a "run length encoding" or RLE approach to
compose a compressed special value array, in addition to storing a regular
packed array of the nonspecial values. The index location of the nonspecial
values can be inferred from the RLE array of special values during computation.

Let's represent the array of special values using a 16bit word to represent
each run of special values, and an 8 bit word for each run of nonspecial
values. Each of the two word types uses the first bit (the sign bit) to
represent the run as consisting of special values (positive) or nonspecial
values (negative). Each special value word uses the next 2 bits to describe what
kind of special values are in the run, one of zero, inf, inf and NVP.

A complete array will be represented as a collection of these two types of
words, terminated by a special "end of array" word consisting of 16 bits of
zeros (0x0000). For example:
 <special values><non special values><special values><0x0000>

This can be seen in bytes form:
 <2 bytes><1 byte><2 bytes><2 bytes>

And if the special values are zeros (type 0), with runs of 8192 in each of the
two runs with a run of 128 nonspecial values, the bits in this array would be:
 <0001111111111111><11111111><0001111111111111><0000000000000000>

Note that if there are typically a large number of zeros followed by only one
nonzero value, then this representation will waste a lot of space in the
nonspecial value words. We could shorten the representation of nonspecial
values to say 5 bits, which would allow for runs of 16 to be represented by
each word and we would save three bits per run. However, the alignment of the
index computations would be less than 1 byte, which would likely reduce the
computational efficiency.

If a given run has more than the maximum representable by the word, words are
concatenated as needed to represent the run.

For matrices we can use the compressed row format, or the compressed column
format  for purpose of the efficiency discussion let's assumed compressed rows.

The storage efficiency of this approach is more complicated to compute, as the
storage will depend on the dispersion of values. We can compute best case and
worst case scenarios however.

A reasonable "best case" scenario will have one nonspecial value per row. In
this case the storage required by the RLE format for a real matrix with N nonzero values,
NI rows and NJ columns is:
 Comp Size = (9 bytes) * N + (2 bytes) *(NJ*(NI/8192) +1)
 = (9 bytes) *NI + (2 bytes) *(NJ*(NI/8192) +1)
 for the common case of NI < 8192:
 = (9 bytes) *NI + (2 bytes) *(NJ +1)

In this case, the storage is O(NI+NJ). For instance, if we have a matrix of
size 2,000 by 2,000, the storage is (8 bytes) * 4,000 = 32KB.

For a worst case scenario, we can assume alternating nonspecial and special
values, in which case the storage is:
 Comp Size = (9 bytes) * N + (2 bytes) * (NJ*(NI/2) +1)
 = (9 bytes) *NJ*(NI/2) + (2 bytes) * (NJ*(NI/2) +1)
 = (11 bytes) *NJ*(NI/2) + (2 bytes)

What's interesting is that this worst case size is actually smaller (by almost
half) than storing the full dense matrix. This is due to the more efficient
representation of the special values in 2 byte format. The overhead is nearly
equivalent to storing the special values in addition to the nonspecial values.

If we have a dense matrix, there is almost zero overhead because the special
value array is blank and we only store the nonspecial data values.

If we compare the MKL approach savings to the RLE approach in the best case
scenario, we have an RLE savings of:
 RLE
 Savings = (Raw Size)  (Comp Size)
 = (8 bytes) *NI*NJ  (9 bytes)*NI  (2 bytes)*(NJ +1)
 for NI ~ NJ:
 ~ (8 bytes) *NI*NJ  (11 bytes)*NI
 ~ (8 bytes) *NI*(NJ1)

 MKL
 Savings = (8 bytes) ( NI*NJ  3NI )
 = (8 bytes) *NI*(NJ3)

The savings in the best case is nearly the same. In the worst case scenario of
a dense matrix, the savings of the RLE approach is nearly 0, which is the best
result possible. For MKL however, the savings in the worst case is more than
a 100% storage penalty.

Most cases are somewhere between best case and worst case. We can conclude
that the storage of the RLE scheme is going to be more efficient than MKL in
all cases.
http://gitwipus.apache.org/repos/asf/incubatorhawq/blob/f58baae0/contrib/gp_sparse_vector/README.devel

diff git a/contrib/gp_sparse_vector/README.devel b/contrib/gp_sparse_vector/README.devel
deleted file mode 100644
index ed39616..0000000
 a/contrib/gp_sparse_vector/README.devel
+++ /dev/null
@@ 1,63 +0,0 @@
================================================================================
Structure of the sparse type support
================================================================================

There are two parts to the sparse types:
1) an inmemory data structure and computational library for sparse data called
SparseData.

2) a Greenplum/Postgres datatype for each case, initially a sparse vector which
uses only (double) as it's data

Eventually there will be sparse vector support for other base types like float,
complex double, integer, char and bit. There will also be matrix support,
which will use the SparseData structure to compose higher dimension structures.

=========================
Design
=========================

The Sparse Vector type is a variable length Postgres datatype, which means that
its data is stored in a "varlena". A varlena is a structure that has an integer
that identifies the length of it's data area as it's first item.

For the Sparse Vector, the varlena has a leading integer that describes the size
or "dimension" of the vector it represents, followed by a serialized SparseData
of type FLOAT8OID.

All of the operators required for the datatype support are wrappers that use the
SparseData operators. In order to use a SparseData operator, the serialized
SparseData inside the Sparse Vector is used to create an inplace accessible
SparseData (no copy required), and then this is passed to the SparseData
functions.

If you want to implement a new operator or function for Sparse Vector, you should
try to implement it inside of operators.c using existing SparseData functions or
building on other Sparse Vector functions first. If something isn't there, you
can build a base level function inside of the SparseData.h as an inline function
or inside SparseData.c as a noninline function, then wrap it in a Sparse Vector
function in operators.c.

The sparse_vector.c file is for creation and management functions related to the
type.

=========================
Code structure
=========================

SparseData structure and functions:

SparseData.c, SparseData.h

Sparse vector datatype:

sparse_vector.c, sparse_vector.h, operators.c

Functions for text processing (Sparse Feature Vector or SFV):

gp_sfv.c

Definitions for IEEE double and float types including special denormalized
types defined for use herein as "no value present":

float_specials.h
http://gitwipus.apache.org/repos/asf/incubatorhawq/blob/f58baae0/contrib/gp_sparse_vector/README.update

diff git a/contrib/gp_sparse_vector/README.update b/contrib/gp_sparse_vector/README.update
deleted file mode 100644
index c7d84d4..0000000
 a/contrib/gp_sparse_vector/README.update
+++ /dev/null
@@ 1,72 +0,0 @@
On Dynamic Word Size for the RLE index
==========================================================================
I thought we'd use a method where we identify the size of the next word in
the index by leveraging the top two bits of the word to describe how large
the following word is.

This idea works if the word is layed out with the "top bits" inside the
first byte. However, big endian systems (Intel, AMD) store the topmost
bits of their words as the *last* byte in the word, which means that we
can't tell what size the word is without reading the word. We can't
read the word without knowing the size, so this is a nonstarter.

We could encode the size information into the bottom bits of the word,
but then we end up with different code paths for big endian systems and
little endian systems. This is dangerous, and means we would not have
a portable binary format between systems with different endianness.

An alternative strategy would use the entire first byte of an RLE word
to encode the following word size. A modification of this scheme would
encode the word size into the top two bits of the first byte, and if
the word size is a single byte, then the remainder of the byte would
represent the encoded length. This preserves the worstcase compression
of the previous scheme and introduces an additional overhead to the
middle compression cases equal to one byte per RLE word.

==========================================================================
A multiplatform friendly dynamic word size RLE scheme
==========================================================================
We have a sequence of numbers and we want a scheme that will store them in
the most compact format possible using binary types of 8, 16, 32 and 64 bit
widths.

Let's say we have the following sequence: {13, 2000, 2*10^12, 3*10^26}. We
can see that it would require one byte to minimally store the first number,
two to store the second, 4 to store the third and 8 for the fourth.

We can store the sequence using these minimum widths and we will have an
array with sizes (in bytes): {1,2,4,8}. However, to read this array, we
need to be able to determine the width of each word.

Here we will use a "length word" to describe the length of the storage
type for each entry. The size of the length word will always be 1 byte,
though in the case where the stored number is less than 127, we will not
use a length word, we'll just store a "127" in a single byte. The value
of the length word will be one of 1,2,4 or 8 to indicate the size of the
following storage width.

So our example would be stored in an array with the following word widths
and contents (in parenthesis):
 {1(13), 1(2), 2(2000), 1(4), 4(2*10^12), 1(8), 8(3*10^26)}

The total size of this array is 18 bytes, which should be compared to the
size required if an 8 byte width was used to store all words, in which
case we would need 32 bytes to store the array. This is a savings of
nearly half in this simple case. In the worst case where we are storing
numbers that all require 8 byte widths, we would be adding one byte per
word, or a 12.5% overhead. If we have lots of 1 byte words interspersed
with an occasional 8 byte word, the savings move toward an 8:1 compression
ratio.

The algorithm for reading this array is as follows:
 Read first byte "B"
 if (B < 0)
 N = (B)
 else
 Copy "B" byte word into N

We have to be careful about alignment when reading the dynamic sized word
from the array  we can't just cast the next "B" bytes into the appropriate
sized type and dereference it. This would cause an unaligned access error,
which will result in an abort on some architectures and poor performance on
others.
http://gitwipus.apache.org/repos/asf/incubatorhawq/blob/f58baae0/contrib/gp_sparse_vector/README_INSTALL

diff git a/contrib/gp_sparse_vector/README_INSTALL b/contrib/gp_sparse_vector/README_INSTALL
deleted file mode 100644
index 0975541..0000000
 a/contrib/gp_sparse_vector/README_INSTALL
+++ /dev/null
@@ 1,9 +0,0 @@
Requirements:
  Greenplum Database 3.3 or higher
  A "C" compiler

To install:
  Make sure that you have the Greenplum binary distribution in your path
  type 'make install'
  install the function and type definitions with "psql d <your_db_name> f gp_svec.sql"
  <optional> run tests with "psql d <your_db_name> af gp_svec_test.sql  diff
 test_output"
http://gitwipus.apache.org/repos/asf/incubatorhawq/blob/f58baae0/contrib/gp_sparse_vector/README_term_proximity

diff git a/contrib/gp_sparse_vector/README_term_proximity b/contrib/gp_sparse_vector/README_term_proximity
deleted file mode 100644
index 792d070..0000000
 a/contrib/gp_sparse_vector/README_term_proximity
+++ /dev/null
@@ 1,118 +0,0 @@
01234567890123456789012345678901234567890123456789012345678901234567890123456789
In finding the proximity of several terms, we first need to represent the
positions of the terms in the document. Let's say we use the following to
represent a set of term occurrences:
 doc1 term1:1 term2:2 term3:3
 doc2 term1:30 term2:10 term3:20

if we performed a search query against these two documents using the query:
(term1 && term2 && term3), we would like the weight of results to reflect
the
proximity of these terms. In document1, the three terms are clustered together
without separation and in document2 they are separated by 10 positions, so we
would expect the weight of search results to reflect this.

One possible weight of the search would take the like of matching terms for each
document and construct a weight like:
 Matching Weight = MAX(positions)  MIN(positions)

Using this metric for our test case, document1 would have a weight of 2 and
document2 would have a weight of 20. We can easily construct a method to
determine the best match.

But what if there are multiple positions at which the terms appear? For
instance, what if the term occurrences in documents 3 and 4 looked like this:
 doc3 term1:1,40,4 term2:2 term3:3
 doc4 term1:1 term2:2,30,50 term3:3,10

Now the proximity grouping of term1,2,3 is obscured by our previous weight,
document 3 and 4 evaluate to weights of 39 and 49 for this search.

============================================
A sparse matrix method for term proximity
============================================
If we consider the list of positions for each term as a "position vector" that
has an entry for each place the term occurs in a document, then we can write our
terms and frequencies like a matrix for each document where each column is a
position vector with a 1 where the term exists in the document and zeros where
it doesn't exist, like this:
 doc1
 term1 term2 term3 term4
count at pos1 0 1 0 0
count at pos2 1 0 0 0
count at pos3 0 0 1 0
count at pos4 1 0 0 0
...
count at posN 0 0 1 0

The dimensions of this matrix are (Nterms x Npositions).

If we were to look for the query (term1 && term3) in the above document, we
could construct a query vector with a 1 in each of the 1 and 3 term positions
and zeros elsewhere like this:
 query
 term1 1
 term2 0
 term3 1
 term4 0

Now we can get a histogram of term occurrences by evaluating the matrix x vector
product like this:
 histogram = doc1 %*% query

This yields the following vector for this case (assuming we only have 4 positions):
 histogram
 term1 2
 term2 0
 term3 1
 term4 0

We desire a weighting by proximity: now that we have an arithmetic for
calculating intersections of queries with occurrences perhaps we can amend the
weighting of the positions to accomplish a proximity weight?

Consider the following two vectors of term positions for terms 1 and 3 in
document 1:
 term1 term3
count at pos1 0 0
count at pos2 0 0
count at pos3 0 1
count at pos4 1 0
...
count at posN 0 1

In this case, we have two occurrences of term3, but only one is close to term1.
If we evaluate the histogram, we'll simply get two matches of term3 and one
match of term1 without any bias or consideration that we have terms close to
each other.
 histogram
 term1 1
 term2 0
 term3 2
 term4 0

What if we were to amend the term positions, allowing them to "bleed" across the
position boundaries like this:
 term1 term3
count at pos1 0 0
count at pos2 0 .1
count at pos3 .1 1
count at pos4 1 .1
count at pos5 .1 0
...
count at posN1 0 .1
count at posN 0 1

We can then compose a "blended term vector" that is the scalar multiplication of
these weighted term position vectors (insert proper matrix multiplication here)
so that we have a reinforcement of overlapping positions. Where the two terms
are separated by one position, there will be a value of 1.1.

We can extend this blending to provide a distribution of values across a larger
range of positions. For example if we use a distribution like:
 0.2,0.4,0.8,1.0,0.8,0.4,0.2

If we have a value of 1.8 we know that the terms are next to each other and if
we have 1.2 we know they are 3 characters apart.

With multiple search terms, this becomes a continuous weighting factor that can
be normalized by the number of search terms.
