hawq-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From m..@apache.org
Subject [5/5] incubator-hawq git commit: HAWQ-49. Remove madlib related codes and tests.
Date Tue, 16 Feb 2016 01:33:23 GMT
HAWQ-49. Remove madlib related codes and tests.

Project: http://git-wip-us.apache.org/repos/asf/incubator-hawq/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-hawq/commit/f58baae0
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hawq/tree/f58baae0
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hawq/diff/f58baae0

Branch: refs/heads/master
Commit: f58baae0194fb94fc83b1dc7b8e3c7f5994ad115
Parents: 5199982
Author: Ming LI <mli@pivotal.io>
Authored: Mon Feb 15 15:39:07 2016 +0800
Committer: Ming LI <mli@pivotal.io>
Committed: Tue Feb 16 09:31:57 2016 +0800

 GNUmakefile.in                                 |    4 -
 contrib/gp_sparse_vector/Makefile              |   38 -
 contrib/gp_sparse_vector/README                |  132 --
 contrib/gp_sparse_vector/README.bitmap         |  259 ---
 contrib/gp_sparse_vector/README.devel          |   63 -
 contrib/gp_sparse_vector/README.update         |   72 -
 contrib/gp_sparse_vector/README_INSTALL        |    9 -
 contrib/gp_sparse_vector/README_term_proximity |  118 --
 contrib/gp_sparse_vector/SparseData.c          | 1572 -------------------
 contrib/gp_sparse_vector/SparseData.h          |  234 ---
 contrib/gp_sparse_vector/bugs                  |  122 --
 contrib/gp_sparse_vector/float_specials.h      |  171 --
 contrib/gp_sparse_vector/operators.c           |  939 -----------
 contrib/gp_sparse_vector/sfv_test_output       |  141 --
 contrib/gp_sparse_vector/sparse_vector.c       |  411 -----
 contrib/gp_sparse_vector/sparse_vector.h       |  175 ---
 contrib/gp_sparse_vector/test_output           |  296 ----
 pom.xml                                        |    2 -
 src/backend/catalog/Makefile                   |    2 +-
 src/backend/catalog/madlib.sql                 |  492 ------
 src/test/regress/expected/madlib_svec_test.out |  236 ---
 src/test/regress/known_good_schedule           |    2 -
 src/test/regress/sql/madlib_svec_test.sql      |   56 -
 src/tools/msvc/Mkvcbuild.pm                    |    2 +-
 24 files changed, 2 insertions(+), 5546 deletions(-)

diff --git a/GNUmakefile.in b/GNUmakefile.in
index 40e887b..f20ee2c 100644
--- a/GNUmakefile.in
+++ b/GNUmakefile.in
@@ -14,7 +14,6 @@ all:
 	$(MAKE) -C config all
 	$(MAKE) -C contrib/formatter_fixedwidth all
 	$(MAKE) -C contrib/extprotocol all	
-	$(MAKE) -C contrib/gp_sparse_vector all
 	$(MAKE) -C contrib/orafce all
 	$(MAKE) -C contrib/gp_cancel_query all
 	$(MAKE) -C contrib/hawq-hadoop all
@@ -27,7 +26,6 @@ install:
 	$(MAKE) -C config $@
 	$(MAKE) -C contrib/formatter_fixedwidth $@
 	$(MAKE) -C contrib/extprotocol $@
-	$(MAKE) -C contrib/gp_sparse_vector $@
 	$(MAKE) -C contrib/orafce $@
 	$(MAKE) -C contrib/gp_cancel_query $@
 	$(MAKE) -C contrib/hawq-hadoop $@
@@ -40,7 +38,6 @@ installdirs uninstall:
 	$(MAKE) -C config $@
 	$(MAKE) -C contrib/formatter_fixedwidth $@
 	$(MAKE) -C contrib/extprotocol $@
-	$(MAKE) -C contrib/gp_sparse_vector $@
 	$(MAKE) -C contrib/orafce $@
 	$(MAKE) -C contrib/gp_cancel_query $@
 	$(MAKE) -C contrib/hawq-hadoop $@
@@ -60,7 +57,6 @@ clean:
 	$(MAKE) -C config $@
 	$(MAKE) -C contrib/formatter_fixedwidth $@
 	$(MAKE) -C contrib/extprotocol $@
-	$(MAKE) -C contrib/gp_sparse_vector $@
 	$(MAKE) -C contrib/orafce $@
 	$(MAKE) -C contrib/gp_cancel_query $@
 	$(MAKE) -C contrib/hawq-hadoop $@

diff --git a/contrib/gp_sparse_vector/Makefile b/contrib/gp_sparse_vector/Makefile
deleted file mode 100644
index 50468be..0000000
--- a/contrib/gp_sparse_vector/Makefile
+++ /dev/null
@@ -1,38 +0,0 @@
-# Licensed to the Apache Software Foundation (ASF) under one
-# or more contributor license agreements.  See the NOTICE file
-# distributed with this work for additional information
-# regarding copyright ownership.  The ASF licenses this file
-# to you under the Apache License, Version 2.0 (the
-# "License"); you may not use this file except in compliance
-# with the License.  You may obtain a copy of the License at
-#   http://www.apache.org/licenses/LICENSE-2.0
-# Unless required by applicable law or agreed to in writing,
-# software distributed under the License is distributed on an
-# KIND, either express or implied.  See the License for the
-# specific language governing permissions and limitations
-# under the License.
-#Intel compiler
-ifeq "$(CC)" "icc"
-	USE_ICC = 1
-override CFLAGS+=-std=gnu99
-MODULE_big = gp_svec
-OBJS = sparse_vector.o operators.o SparseData.o
-ifdef USE_PGXS
-	PGXS := $(shell pg_config --pgxs)
-	include $(PGXS)
-	subdir = contrib/gp_sparse_vector
-	top_builddir = ../..
-	include $(top_builddir)/src/Makefile.global
-	include $(top_srcdir)/contrib/contrib-global.mk
-ifdef USE_ICC
-	override CFLAGS=-O3 -Werror -std=c99 -vec-report2 -vec-threshold0

diff --git a/contrib/gp_sparse_vector/README b/contrib/gp_sparse_vector/README
deleted file mode 100644
index a05985e..0000000
--- a/contrib/gp_sparse_vector/README
+++ /dev/null
@@ -1,132 +0,0 @@
-					Sparse Vector Datatype
-When you use arrays of floating point numbers for various calculations, you will often have
long runs of zeros.  This is common in many kinds of applications such as scientific, retail
optimization and text processing.  Each floating point number takes 8 bytes of storage in
memory and/or disk, so saving those zeros is often impractical.  There are also many computations
that benefit from skipping over the zeros.
-Say for example that we have the following array of doubles in Postgres stored as a "float8[]":
-    '{0, 33,...40,000 zeros..., 12, 22 }'::float8[]
-This array would occupy slightly more than 320KB of memory/disk, most of it zeros.  Also,
if we perform various operations on this, we'll be doing work on 40,001 fields that aren't
important.  This kind of array arises often in text processing, where the dictionary may have
40-100K terms and each vector will store the number of words in a particular document.
-We could use the built in "Array" datatype in Postgres/GP, which has a bitmap for null values,
but since it is not optimized for float8[], or for long runs of zeros instead of nulls and
the bitmap is not RLE compressed, it is a poor choice for this use-case.  If we stored the
zeros as NULL in the Postgres/GP array, then the bitmap for nulls would have 5KB of bits in
it to mark the nulls, which is not nearly as efficient as we'd like.
-A simple RLE scheme that is biased toward being efficient for zero value bitmap results in
6 bytes for bitmap storage.
-We also want to implement vector operators for our type that take advantage of an efficient
sparse storage format to make computations faster.
-What is it?
-The "Sparse Vector Datatype" is named "svec" and implements a vector with compressed storage
of zeros.  You can take the above vector and cast it into an svec like this:
-	('{0, 33,...40,000 zeros..., 12, 22 }'::float8[])::svec
-This is the same vector we input (I only used 5 zeros).
-We can use operations with svec type like <, >, *, **, /,=,+,SUM,COUNT_VEC etc, and
they have meanings associated with typical vector operations.  The plus (+) operator adds
each of the terms of two vectors of same dimension together.  For instance, if we have a =
{0,1,5} and b = {4,3,2}, then a+b = {4,4,7}.  We can see this using the svec type like this:
-	lukelonergan=# select ('{0,1,5}'::float8[]::svec + '{4,3,2}'::float8[]::svec)::float8[];
- 	float8  
-	---------
- 	{4,4,7}
-A vector dot product (*) between these two will result in a scalar result of type float8.
 The dot product should be (0*4+1*3+5*2)=13, like this:
-	lukelonergan=# select '{0,1,5}'::float8[]::svec * '{4,3,2}'::float8[]::svec;
- 	?column? 
-	----------
-       	13
-A scalar product (**) between these two will result in the individual terms being multiplied.
 The scalar product should be {0*4,1*3,5*2}={0,3,10}, like this:
-	lukelonergan=# select ('{0,1,5}'::float8[]::svec ** '{4,3,2}'::float8[]::svec)::float8[];
-  	float8  
-	----------
- 	{0,3,10}
-Special vector aggregate functions are also useful.  SUM is self explanatory.  COUNT_VEC
(as opposed to COUNT) evaluates the count of non-zero terms found in a set of svec and returns
an svec with the counts.  For instance, if we have {0,1,5},{10,0,3},{0,0,3},{0,1,0}, then
COUNT_VEC() would return {1,2,3}:
-	lukelonergan=# create table list (a svec);
-	lukelonergan=# insert into list values ('{0,1,5}'::float8[]),('{10,0,3}'::float8[]),('{0,0,3}'::float8[]),('{0,1,0}'::float8[]);
-	INSERT 0 4
-	lukelonergan=# select COUNT_VEC(a)::float8[] from list;
- 	count_vec 
-	-----------
- 	{1,2,3}
-For a text classification example, lets assume we have a dictionary composed of words in
a text array:
-create table features (a text[][]) distributed randomly;
-insert into features values ('{am,before,being,bothered,corpus,document,i,in,is,me,never,now,one,really,second,the,third,this,until}');
-We have a set of documents, also defined as arrays of words:
-	lukelonergan=# create table documents(a int,b text[]);
-	lukelonergan=# insert into documents values (1,'{this,is,one,document,in,the,corpus}');
-	lukelonergan=# insert into documents values (2,'{i,am,the,second,document,in,the,corpus}');
-	lukelonergan=# insert into documents values (3,'{being,third,never,really",bothered,me,until,now}');
-	lukelonergan=# insert into documents values (4,'{the,document,before,me,is,the,third,document}');
-Now we have a dictionary and some documents, we would like to do some document classification
using vector arithmetic on word counts and proportions of dictionary words in each document.
-To start this process, well need to find the dictionary words in each document.  Well prepared
what is called a Sparse Feature Vector or SFV for each document.  An SFV is a vector of dimension
N, where N is the number of dictionary words, and in each SFV is a count of each dictionary
word in the document.
-Inside the sparse vector toolset, we have a function that will create an SFV from a document,
so we can just do this:
-	lukelonergan=# select gp_extract_feature_histogram((select a from features limit 1),b)::float8[]
from documents;
-             	gp_extract_feature_histogram             
-	-----------------------------------------
- 	{0,0,0,0,1,1,0,1,1,0,0,0,1,0,0,1,0,1,0}
- 	{0,0,1,1,0,0,0,0,0,1,1,1,0,1,0,0,1,0,1}
- 	{1,0,0,0,1,1,1,1,0,0,0,0,0,0,1,2,0,0,0}
- 	{0,1,0,0,0,2,0,0,1,1,0,0,0,0,0,2,1,0,0}
-Note that the output of gp_extract_feature_histogram is an svec for each document containing
the count of each of the dictionary words in the ordinal positions of the dictionary.  This
can more easily be understood by doing this:
-	lukelonergan=# select gp_extract_feature_histogram((select a from features limit 1),b)::float8[],b
from documents;
-             	gp_extract_feature_histogram             |                        b       
-	-----------------------------------------+--------------------------------------------------
- 	{1,0,0,0,1,1,1,1,0,0,0,0,0,0,1,2,0,0,0} | {i,am,the,second,document,in,the,corpus}
- 	{0,1,0,0,0,2,0,0,1,1,0,0,0,0,0,2,1,0,0} | {the,document,before,me,is,the,third,document}
- 	{0,0,0,0,1,1,0,1,1,0,0,0,1,0,0,1,0,1,0} | {this,is,one,document,in,the,corpus}
- 	{0,0,1,1,0,0,0,0,0,1,1,1,0,1,0,0,1,0,1} | {being,third,never,really,bothered,me,until,now}
-	lukelonergan=# select * from features;
-                                                   	a                                   
-	--------------------------------------------------------------------------------------------------------
- 	{am,before,being,bothered,corpus,document,i,in,is,me,never,now,one,really,second,the,third,this,until}
-Now when we look at the first document "i am the second document in the corpus", its SFV
is {1,3*0,1,1,1,1,6*0,1,2}.  The word "am" is the first ordinate in the dictionary and there
is 1 instance of it in the SFV.  "before" has no instances in the document, so its value is
"0" and so on.
-gp_extract_feature_histogram is very speed optimized - in essence a single routine version
of a hash join, and should be able to process large numbers of documents into their SFVs in
parallel at the highest possible speeds.
-The rest of the classification process is all vector math.  As Brian Dolan describes the
-	The actual count is hardly ever used.  Instead, it's turned into a weight.  The most common
weight is called tf/idf for "Term Frequency / Inverse Document Frequency".
-	The calculation for a given term in a given document is {#Times in this document} * log
{#Documents / #Documents  the term appears in}
-	For instance, the term "document" in document A would have weight 1 * log (4/3).  In document
D it would have weight 2 * log (4/3).
-	Terms that appear in every document would have tf/idf weight 0, since log (4/4) = log(1)
= 0.  (Our example has no term like that.)  That usually sends a lot of values to 0.
-For this part of the processing, well need to have a sparse vector of the dictionary dimension
(19) with the values (log(#documents/#Documents each term appears in).  There will be one
such vector for the whole list of documents (aka the "corpus").  The #documents is just a
count of all of the documents, in this case 4, but there is one divisor for each dictionary
word and its value is the count of all the times that word appears in the document.  This
single vector for the whole corpus can then be scalar product multiplied by each document
SFV to produce the Term Frequency/Inverse Document Frequency.
-This can be done as follows:
-	lukelonergan=# create table corpus as (select a,gp_extract_feature_histogram((select a from
features limit 1),b) sfv from documents);
-	lukelonergan=# SELECT docnum, (a*logidf) tf_idf FROM (SELECT log(count(a)/count_vec(a))
logidf FROM corpus) foo, corpus ORDER BY docnum;
- 	 docnum |                                                                          tf_idf
-	--------+----------------------------------------------------------------------------------------------------------------------------------------------------------
-      	      1 | {4,1,1,1,2,3,1,2,1,1,1,1}:{0,0.693147180559945,0.287682072451781,0,0.693147180559945,0,1.38629436111989,0,0.287682072451781,0,1.38629436111989,0}
-      	      2 | {1,3,1,1,1,1,6,1,1,3}:{1.38629436111989,0,0.693147180559945,0.287682072451781,1.38629436111989,0.693147180559945,0,1.38629436111989,0.575364144903562,0}
-      	      3 | {2,2,5,1,2,1,1,2,1,1,1}:{0,1.38629436111989,0,0.693147180559945,1.38629436111989,0,1.38629436111989,0,0.693147180559945,0,1.38629436111989}
-      	      4 | {1,1,3,1,2,2,5,1,1,2}:{0,1.38629436111989,0,0.575364144903562,0,0.693147180559945,0,0.575364144903562,0.693147180559945,0}
-We can now get the "angular distance" between one document and the rest of the documents
using the ACOS of the dot product of the document vectors:
-	lukelonergan=# CREATE TABLE weights AS (SELECT docnum, (a*logidf) tf_idf FROM (SELECT log(count(a)/count_vec(a))
logidf FROM corpus) foo, corpus ORDER BY docnum) DISTRIBUTED RANDOMLY;
-	Calculate the angular distance between the first document to each other document
-	lukelonergan=# SELECT docnum,180.*(ACOS(dmin(1.,(tf_idf%*%testdoc)/(svec_l2norm(tf_idf)*svec_l2norm(testdoc))))/3.141592654)
angular_distance FROM weights,(SELECT tf_idf testdoc FROM weights WHERE docnum = 1 LIMIT 1)
foo ORDER BY 1;
- 	docnum | angular_distance 
-       --------+------------------
-             1 |                0
-             2 | 78.8235846096986
-             3 | 89.9999999882484
-             4 | 80.0232034288617
-We can see that the angular distance between document 1 and itself is 0 degrees and between
document 1 and 3 is 90 degrees because they share no features at all.

diff --git a/contrib/gp_sparse_vector/README.bitmap b/contrib/gp_sparse_vector/README.bitmap
deleted file mode 100644
index 901a3f5..0000000
--- a/contrib/gp_sparse_vector/README.bitmap
+++ /dev/null
@@ -1,259 +0,0 @@
-Revision: 02/03/10
-I’m spending most of my part time on working out the actual mechanics (with examples) of
the matrix/vector implementation.
-What will help is to get some input on likely use-cases that can drive the work – though
I’ve got a good number of examples from scientific and econometric modeling, I could always
benefit from something concrete and immediate.
-Following is more on mechanics.    
-Another finding on indexes, RLE compression and matrices:
-input matrices come in various flavors, many of them in non-sorted “not ready for sparse
compression” formats 
-in many circumstances there is no a-priori choice of whether there are zeros or NVP 
-Because of (1), in the input processing and sorting, it is necessary to create an uncompressed
index that is identical to the single RLE index I’ve described previously, sans the special
-I’ve described the input processing here in a program I wrote to process matrices from
the Matrix Mart at NIST:
-                /* The Matrix Market format does not specify the order in which the
-                 * successive elements are stored.  We need to compress the matrix into
-                 * one of CRS or CCS format (row or column compression), so we need to
-                 * sort the matrix into one of those forms.
-                 *
-                 * Here we'll use quicksort to sort the matrix elements into row
-                 * contiguous storage.
-                 *
-                 * If we desire row contiguous storage, we'll also need the column order
-                 * within the row preserved.  We'll do this by computing the index into
-                 * the matrix using a "columns progressive" ordering, which is computed
-                 * as NJ * I + J. Then we'll sort based on that index.  The index value
-                 * will range from 0 to NJ*NI-1. For any index value, you can compute I
-                 * and J using:
-                 *      I = index/NJ
-                 *      J = index - NJ * I
-                 *
-                 * If we sort the matrix by this index, we'll have things in row/column
-                 * order, suitable for CRS compression.
-                 *
-                 * Note that if we want CCS, we can construct the index using:
-                 *      index = NI * J + I
-                 */
-In the Yale/MKL formats, we deconstruct the single index into a set of two values for each
run and that is stored as the compression index.
-I’m thinking now that for the RLE index we can do something along these lines:
-Treat all values symmetrically, that is: all runs of the various types we’re discussing
(inf,-inf,0,NVP,value) are described by the same kind of word 
-Each RLE word is composed of a leading set of descriptor bits, followed by a payload describing
how many of the value are found.
-So if we pick the descriptor bits = 3, we can represent 8 “things”, of which 5 things
must be “what kind of value is this?”, leaving 3 things.  A good use of the 3 things can
be word length, one of 8, 16 or 32.
-So if we go down this path, we’re left with 5, 13 and 29 bits to represent a run and can
dynamically choose which one to use based on the data content.  That’s 32, 8192 and 536,870,912
values in each run, which is perfect for capturing various kinds of sparsity.
-If we expand to one more descriptor bit, then we can represent 8 more things, which could
be used to recognize additional types of values to compress out.  That would leave us with
16, 4096 and 268,435,456 values in each run, which also works pretty well and leaves us room
for the future in case we find there are more values than {inf,-inf,0,NVP,value}.
-Some additional results to share (see table below) – I did a computation of the various
storage and compute requirements for the MKL/Yale index and the RLE index.  I used 5 matrix
examples from NIST’s “Matrix Market”, read them in and translated them from Matrix Market
format to MKL/RLE which required sorting them into CRS.  From this, I calculated the relevant
stats on each in terms of storage and number of words in each of the RLE and MKL indices.
-Bottom line is that the results corroborated my earlier calculations.  The RLE index is a
lot smaller than MKL always, and is more computationally efficient for linear scans through
the compressed matrix.  This means we should use only RLE for most use-cases and calculate
MKL only when we need to use the algorithms in the matrix solver libraries.  We can optionally
store the MKL index if we calculate it – we can provide a marker to indicate the presence/absence
of it in the type. 
-I’ve also thought through the compatibility with all the sparse solver libs – seems like
we have easy compatibility with a CRS-based approach, so I think that should be the baseline.
 We can provide another marker to indicate what is being stored – CRS, CCS, CDS, etc.
-I believe I have enough information to produce a draft design of the matrix type that we
can review and get some other feedback about.  Here is a sketch of how I think it should work:
-Two separate types, one for vectors and one for matrices – but they share the same format
and support routines.  The separate types allows definition of vector and matrix specific
-The format stores the following metadata about the datum: 
-    nD, (n(i),i=1,nD): dimension and rank – for a vector, this is {1,N}; for a matrix it’s
-    nZ: number of values (non-special values) 
-    valType: type of value, one of {bit, byte, int4, int8, real4, complex4, real8, complex8}

-    storType: type of storage, one of {CRS, CCS, CDS, ... Up to maybe 32 total with some
-    specValList: a list of special values to be used in the compression, up to 8, e.g. {inf,-inf,NVP,0}

-    hasMKL: if true, the MKL index is stored 
-The format stores the following data: 
-    values: a dense array of values, size is (nZ*sizeof(valType)) 
-    rleIndex: an RLE index containing location of all values (including specials) 
-    sizeofRleIndex: the size of the RLE index for allocation purposes 
-    if (hasMKL) mklIndex: the MKL/Yale index, size is ((nZ+NI)*sizeof(int))
-In order to efficiently compress the storage of multiple data types, we are
-using a bitmap that describes the positions of four special 64-bit floating
-point values: zero, negative infinity, positive infinity, and "no value present"
-or NVP.  Of these, the first three are all standard floating point values that
-can be the natural result of performing calculations. NVP is a special case
-often used in statistics and other analytical processing but is not
-representable as a standard floating point number.
-Here we're using two patterns for sparse storage and computation: 
-1) recognition of compressible types and subsequent bitmap compression
-2) optimized computational usage of previously compressed values
-(1) recognizes that we're going to have many situations where special values
-arise as a natural consequence of computation.  In these cases, we will
-automatically locate and compress special values into a special bitmap storage
-(2) leverages the pre-existing compressed special value storage for optimized
-computation.  For example, if we have a vector of values V and a large fraction
-of the values in V are zero, then computing "W = 1/V" will produce a
-correspondingly large number of positive infinity values in W.  With our
-compressed bitmap storage of zeros, we can loop over the compressed zero
-elements and change their storage type from "zero" to "infinity", thereby
-saving the computation, memory and re-compression overheads.
-It's important to have a bitmap format that is both storage efficient and
-optimal for computation.  There are many existing formats for sparse vector and
-matrix storage, most of them developed for scientific computation.  The Intel
-Math Kernel Library (MKL) specifies the most common sparse matrix storage
-format, derived from the Yale Sparse Matrix format:
- "The Intel MKL direct sparse solvers use a row-major upper triangular storage
-format: the matrix is compressed row-by-row and for symmetric matrices only non-
-zero elements in the upper triangular half of the matrix are stored."
-  The Intel MKL sparse matrix storage format for direct sparse solvers is
-specified by three arrays: values, columns, and rowIndex. The following table
-describes the arrays in terms of the values, row, and column positions of the
-non-zero elements in a sparse matrix. 
-	values 	 - A real or complex array that contains the non-zero elements
-		   of a sparse matrix. The non-zero elements are mapped into the
-		   values array using the row-major upper triangular storage
-		   mapping described above. 
-	columns  - Element I of the integer array columns is the number of the
-		   column that contains the I-th element in the values array. 
-	rowIndex - Element j of the integer array rowIndex gives the index of
-		   the element in the values array that is first non-zero
-		   element in a row j. 
-  The length of the values and columns arrays is equal to the number of non-zero
-elements in the matrix."
-- Intel Math Kernel Library Reference Manual, March 2009, Document Number:
-  630813-031US
-The MKL format stores the non-zero values compactly in a sequential array of
-either complex or double precision entries and maintains two index arrays that
-identify the location of the nonzero values in a two dimensional matrix.  The
-advantages of the MKL format are it's efficiency for computation and the fact
-it's in widespread usage in numerical libraries.
-The uncompressed size of a real matrix with NI rows and NJ columns would be:
-	Raw size = (8 bytes) * ( NI*NJ )
-The storage required by the MKL format for a real matrix with N nonzero values
-and NI number of rows is:
-	Comp Size = (8 bytes) * ( 2N  + NI )
-For a complex matrix it's:
-	Comp Size = (8 bytes) * ( 3N  + NI )
-Another way of representing the storage is to assess the "overhead" associated
-with the representation of the zero entries, which is the same for real and
-	Overhead = (8 bytes) * ( N + NI )
-Now we can calculate the "savings" associated with using the MLK compressed
-format for a real matrix:
-	Savings = (Raw Size) - (Comp Size)
-		  (8 bytes) ( NI*NJ - 2N - NI )
-If we have a dense matrix, then N will be NI*NJ and the savings is equal to the
-negative overhead, which means we'll have a worst case savings of:
-	Dense (worst case) savings = -(8 bytes)(NI*NJ + NI)
-	                           = -(8 bytes)*NI*(NJ+1)
-For a dense matrix, this is a 100%+ storage penalty.
-There is one important shortcoming of the MKL format: it can only represent one
-type of special value - zero.  This is a consequence of the implicit nature of
-the approach, the location of the zero values is not stored, it is implied by
-the lack of a non-zero value in that matrix position.  There is no place to
-provide an attribute of the special value so there can be only one type of
-special value.
-We need an alternative format that can represent different types of special
-values. It will need to refer to the special values explicitly in an efficient
-manner. We can take advantage of the nature of most all sparse storage
-circumstances: the special values occur in relatively large runs compared to
-the non-zero values.  We can use a "run length encoding" or RLE approach to
-compose a compressed special value array, in addition to storing a regular 
-packed array of the non-special values.  The index location of the non-special
-values can be inferred from the RLE array of special values during computation.
-Let's represent the array of special values using a 16-bit word to represent
-each run of special values, and an 8 bit word for each run of non-special
-values. Each of the two word types uses the first bit (the sign bit) to
-represent the run as consisting of special values (positive) or non-special
-values (negative). Each special value word uses the next 2 bits to describe what
-kind of special values are in the run, one of zero, inf, -inf and NVP.
-A complete array will be represented as a collection of these two types of
-words, terminated by a special "end of array" word consisting of 16 bits of 
-zeros (0x0000). For example:
-	<special values><non special values><special values><0x0000>
-This can be seen in bytes form:
-	<2 bytes><1 byte><2 bytes><2 bytes>
-And if the special values are zeros (type 0), with runs of 8192 in each of the
-two runs with a run of 128 non-special values, the bits in this array would be:
-	<0001111111111111><11111111><0001111111111111><0000000000000000>
-Note that if there are typically a large number of zeros followed by only one
-non-zero value, then this representation will waste a lot of space in the
-non-special value words.  We could shorten the representation of non-special
-values to say 5 bits, which would allow for runs of 16 to be represented by
-each word and we would save three bits per run.  However, the alignment of the
-index computations would be less than 1 byte, which would likely reduce the
-computational efficiency.
-If a given run has more than the maximum representable by the word, words are
-concatenated as needed to represent the run.
-For matrices we can use the compressed row format, or the compressed column
-format - for purpose of the efficiency discussion let's assumed compressed rows.
-The storage efficiency of this approach is more complicated to compute, as the
-storage will depend on the dispersion of values.  We can compute best case and
-worst case scenarios however.
-A reasonable "best case" scenario will have one non-special value per row. In
-this case the storage required by the RLE format for a real matrix with N nonzero values,
NI rows and NJ columns is:
-	Comp Size = (9 bytes) * N + (2 bytes) *(NJ*(NI/8192) +1)
-		  = (9 bytes) *NI + (2 bytes) *(NJ*(NI/8192) +1)
-			for the common case of NI < 8192:
-		  = (9 bytes) *NI + (2 bytes) *(NJ +1)
-In this case, the storage is O(NI+NJ).  For instance, if we have a matrix of
-size 2,000 by 2,000, the storage is (8 bytes) * 4,000 = 32KB.
-For a worst case scenario, we can assume alternating non-special and special
-values, in which case the storage is:
-	Comp Size = (9 bytes) * N + (2 bytes) * (NJ*(NI/2) +1)
-		  = (9 bytes) *NJ*(NI/2) + (2 bytes) * (NJ*(NI/2) +1)
-		  = (11 bytes) *NJ*(NI/2) + (2 bytes)
-What's interesting is that this worst case size is actually smaller (by almost
-half) than storing the full dense matrix.  This is due to the more efficient
-representation of the special values in 2 byte format. The overhead is nearly
-equivalent to storing the special values in addition to the non-special values.
-If we have a dense matrix, there is almost zero overhead because the special
-value array is blank and we only store the non-special data values.
-If we compare the MKL approach savings to the RLE approach in the best case
-scenario, we have an RLE savings of:
-	Savings = (Raw Size) - (Comp Size)
-		= (8 bytes) *NI*NJ - (9 bytes)*NI - (2 bytes)*(NJ +1)
-			for NI ~ NJ:
-		~ (8 bytes) *NI*NJ - (11 bytes)*NI
-		~ (8 bytes) *NI*(NJ-1)
-	Savings = (8 bytes) ( NI*NJ - 3NI )
-	        = (8 bytes) *NI*(NJ-3)
-The savings in the best case is nearly the same.  In the worst case scenario of
-a dense matrix, the savings of the RLE approach is nearly 0, which is the best
-result possible.  For MKL however, the savings in the worst case is more than
-a 100% storage penalty.
-Most cases are somewhere between best case and worst case.  We can conclude
-that the storage of the RLE scheme is going to be more efficient than MKL in
-all cases.

diff --git a/contrib/gp_sparse_vector/README.devel b/contrib/gp_sparse_vector/README.devel
deleted file mode 100644
index ed39616..0000000
--- a/contrib/gp_sparse_vector/README.devel
+++ /dev/null
@@ -1,63 +0,0 @@
-Structure of the sparse type support
-There are two parts to the sparse types:
-1) an in-memory data structure and computational library for sparse data called
-2) a Greenplum/Postgres datatype for each case, initially a sparse vector which
-uses only (double) as it's data
-Eventually there will be sparse vector support for other base types like float,
-complex double, integer, char and bit.  There will also be matrix support,
-which will use the SparseData structure to compose higher dimension structures.
-The Sparse Vector type is a variable length Postgres datatype, which means that
-its data is stored in a "varlena".  A varlena is a structure that has an integer
-that identifies the length of it's data area as it's first item.
-For the Sparse Vector, the varlena has a leading integer that describes the size
-or "dimension" of the vector it represents, followed by a serialized SparseData
-of type FLOAT8OID.
-All of the operators required for the datatype support are wrappers that use the
-SparseData operators.  In order to use a SparseData operator, the serialized
-SparseData inside the Sparse Vector is used to create an in-place accessible
-SparseData (no copy required), and then this is passed to the SparseData
-If you want to implement a new operator or function for Sparse Vector, you should
-try to implement it inside of operators.c using existing SparseData functions or
-building on other Sparse Vector functions first.  If something isn't there, you
-can build a base level function inside of the SparseData.h as an inline function
-or inside SparseData.c as a non-inline function, then wrap it in a Sparse Vector
-function in operators.c.
-The sparse_vector.c file is for creation and management functions related to the
-Code structure
-SparseData structure and functions:
-SparseData.c, SparseData.h
-Sparse vector datatype:
-sparse_vector.c, sparse_vector.h, operators.c
-Functions for text processing (Sparse Feature Vector or SFV):
-Definitions for IEEE double and float types including special denormalized
-types defined for use herein as "no value present":

diff --git a/contrib/gp_sparse_vector/README.update b/contrib/gp_sparse_vector/README.update
deleted file mode 100644
index c7d84d4..0000000
--- a/contrib/gp_sparse_vector/README.update
+++ /dev/null
@@ -1,72 +0,0 @@
-On Dynamic Word Size for the RLE index
-I thought we'd use a method where we identify the size of the next word in
-the index by leveraging the top two bits of the word to describe how large
-the following word is.
-This idea works if the word is layed out with the "top bits" inside the
-first byte.  However, big endian systems (Intel, AMD) store the topmost
-bits of their words as the *last* byte in the word, which means that we
-can't tell what size the word is without reading the word.  We can't
-read the word without knowing the size, so this is a non-starter.
-We could encode the size information into the bottom bits of the word,
-but then we end up with different code paths for big endian systems and
-little endian systems.  This is dangerous, and means we would not have
-a portable binary format between systems with different endian-ness.
-An alternative strategy would use the entire first byte of an RLE word
-to encode the following word size.  A modification of this scheme would
-encode the word size into the top two bits of the first byte, and if
-the word size is a single byte, then the remainder of the byte would
-represent the encoded length. This preserves the worst-case compression
-of the previous scheme and introduces an additional overhead to the 
-middle compression cases equal to one byte per RLE word.
-A multi-platform friendly dynamic word size RLE scheme
-We have a sequence of numbers and we want a scheme that will store them in
-the most compact format possible using binary types of 8, 16, 32 and 64 bit
-Let's say we have the following sequence: {13, 2000, 2*10^12, 3*10^26}. We
-can see that it would require one byte to minimally store the first number,
-two to store the second, 4 to store the third and 8 for the fourth.
-We can store the sequence using these minimum widths and we will have an
-array with sizes (in bytes): {1,2,4,8}.  However, to read this array, we
-need to be able to determine the width of each word.
-Here we will use a "length word" to describe the length of the storage
-type for each entry.  The size of the length word will always be 1 byte,
-though in the case where the stored number is less than 127, we will not
-use a length word, we'll just store a "-127" in a single byte. The value
-of the length word will be one of 1,2,4 or 8 to indicate the size of the
-following storage width.
-So our example would be stored in an array with the following word widths
-and contents (in parenthesis):
-	{1(-13), 1(2), 2(2000), 1(4), 4(2*10^12), 1(8), 8(3*10^26)}
-The total size of this array is 18 bytes, which should be compared to the
-size required if an 8 byte width was used to store all words, in which
-case we would need 32 bytes to store the array.  This is a savings of
-nearly half in this simple case.  In the worst case where we are storing
-numbers that all require 8 byte widths, we would be adding one byte per
-word, or a 12.5% overhead.  If we have lots of 1 byte words interspersed
-with an occasional 8 byte word, the savings move toward an 8:1 compression
-The algorithm for reading this array is as follows:
-	Read first byte "B"
-	if (B < 0)
-		N = -(B)
-	else
-		Copy "B" byte word into N
-We have to be careful about alignment when reading the dynamic sized word
-from the array - we can't just cast the next "B" bytes into the appropriate
-sized type and dereference it.  This would cause an unaligned access error,
-which will result in an abort on some architectures and poor performance on

diff --git a/contrib/gp_sparse_vector/README_INSTALL b/contrib/gp_sparse_vector/README_INSTALL
deleted file mode 100644
index 0975541..0000000
--- a/contrib/gp_sparse_vector/README_INSTALL
+++ /dev/null
@@ -1,9 +0,0 @@
-	- Greenplum Database 3.3 or higher
-	- A "C" compiler
-To install:
-	- Make sure that you have the Greenplum binary distribution in your path
-	- type 'make install'
-	- install the function and type definitions with "psql -d <your_db_name> -f gp_svec.sql"
-	- <optional> run tests with "psql -d <your_db_name> -af gp_svec_test.sql | diff
- test_output"

diff --git a/contrib/gp_sparse_vector/README_term_proximity b/contrib/gp_sparse_vector/README_term_proximity
deleted file mode 100644
index 792d070..0000000
--- a/contrib/gp_sparse_vector/README_term_proximity
+++ /dev/null
@@ -1,118 +0,0 @@
-In finding the proximity of several terms, we first need to represent the
-positions of the terms in the document.  Let's say we use the following to
-represent a set of term occurrences:
-	doc1	term1:1		term2:2		term3:3
-	doc2	term1:30	term2:10	term3:20
-if we performed a search query against these two documents using the query:
-(term1 && term2 && term3), we would like the weight of results to reflect
-proximity of these terms.  In document1, the three terms are clustered together
-without separation and in document2 they are separated by 10 positions, so we
-would expect the weight of search results to reflect this.
-One possible weight of the search would take the like of matching terms for each
-document and construct a weight like:
-	Matching Weight = MAX(positions) - MIN(positions)
-Using this metric for our test case, document1 would have a weight of 2 and
-document2 would have a weight of 20.  We can easily construct a method to
-determine the best match.
-But what if there are multiple positions at which the terms appear? For
-instance, what if the term occurrences in documents 3 and 4 looked like this:
-	doc3	term1:1,40,4	term2:2		term3:3
-	doc4	term1:1     	term2:2,30,50	term3:3,10
-Now the proximity grouping of term1,2,3 is obscured by our previous weight,
-document 3 and 4 evaluate to weights of 39 and 49 for this search.
-A sparse matrix method for term proximity
-If we consider the list of positions for each term as a "position vector" that
-has an entry for each place the term occurs in a document, then we can write our
-terms and frequencies like a matrix for each document where each column is a
-position vector with a 1 where the term exists in the document and zeros where
-it doesn't exist, like this:
-	doc1
-	    	term1		term2		term3		term4
-count at pos1	0		1		0		0
-count at pos2	1		0		0		0
-count at pos3	0		0		1		0
-count at pos4	1		0		0		0
-count at posN	0		0		1		0
-The dimensions of this matrix are (Nterms x Npositions).
-If we were to look for the query (term1 && term3) in the above document, we
-could construct a query vector with a 1 in each of the 1 and 3 term positions
-and zeros elsewhere like this:
-	query
-		term1	1
-		term2	0
-		term3	1
-		term4	0
-Now we can get a histogram of term occurrences by evaluating the matrix x vector
-product like this:
-	histogram 	= |doc1| %*% query
-This yields the following vector for this case (assuming we only have 4 positions):
-	histogram
-		term1	2
-		term2	0
-		term3	1
-		term4	0
-We desire a weighting by proximity: now that we have an arithmetic for
-calculating intersections of queries with occurrences perhaps we can amend the
-weighting of the positions to accomplish a proximity weight?
-Consider the following two vectors of term positions for terms 1 and 3 in
-document 1:
-	    	term1		term3
-count at pos1	0		0
-count at pos2	0		0
-count at pos3	0		1
-count at pos4	1		0
-count at posN	0		1
-In this case, we have two occurrences of term3, but only one is close to term1.
-If we evaluate the histogram, we'll simply get two matches of term3 and one
-match of term1 without any bias or consideration that we have terms close to
-each other.
-	histogram
-		term1	1
-		term2	0
-		term3	2
-		term4	0
-What if we were to amend the term positions, allowing them to "bleed" across the
-position boundaries like this:
-	    	term1	term3
-count at pos1	 0	 0
-count at pos2	 0	.1
-count at pos3	.1	 1
-count at pos4	 1	.1
-count at pos5	.1	 0
-count at posN-1	 0	.1
-count at posN	 0	 1
-We can then compose a "blended term vector" that is the scalar multiplication of
-these weighted term position vectors (insert proper matrix multiplication here)
-so that we have a reinforcement of overlapping positions.  Where the two terms
-are separated by one position, there will be a value of 1.1.
-We can extend this blending to provide a distribution of values across a larger
-range of positions.  For example if we use a distribution like:
-	0.2,0.4,0.8,1.0,0.8,0.4,0.2
-If we have a value of 1.8 we know that the terms are next to each other and if
-we have 1.2 we know they are 3 characters apart.
-With multiple search terms, this becomes a continuous weighting factor that can
-be normalized by the number of search terms.

View raw message