spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mingjie Tang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)
Date Wed, 01 Mar 2017 00:19:45 GMT

    [ https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889180#comment-15889180
] 

Mingjie Tang commented on SPARK-19771:
--------------------------------------

If we follow the AND-OR framework, one optimization is here: 

At first, suppose we use BucketedRandomProjectionLSH, and  setNumHashTables(3) and setNumHashFunctions(5).

For one input vector with sparse format e.g., [(6,[0,1,2],[1.0,1.0,1.0])], 
we can compute its mapped hash vectors is
WrappedArray([-1.0,-1.0,-1.0,0.0,0.0], [-1.0,0.0,0.0,-1.0,-1.0], [-1.0,-1.0,-1.0,-1.0,0.0])]

For the similarity-join, we map this computed vector into
+--------------------+-----+--------------------+
|            datasetA|entry|           hashValue|
+--------------------+-----+--------------------+
|[0,(6,[0,1,2],[1....|    0|[-1.0,-1.0,-1.0,0...|
|[0,(6,[0,1,2],[1....|    1|[-1.0,0.0,0.0,-1....|
|[0,(6,[0,1,2],[1....|    2|[-1.0,-1.0,-1.0,-...|

Then, based on AND-OR principle, we using the entry and hashValue for equip-join with other
table . When we look at the the sql, it need to iterate through the hashValue vector for equal
join. Thus, this computation and memory usage cost is very high. For example, if we apply
the nest-loop for comparing two vectors with length of NumHashFunctions, the computation cost
is (NumHashFunctions* NumHashFunctions) and memory overhead is (N* NumHashFunctions). 

Therefore, we can apply one optimization technique here. that is, we do not need to store
the hash value for each hash table like [-1.0,-1.0,-1.0,0.0,0.0], instead, we use a simple
hash function to improve this. 

Suppose h_l= {h_0, h_1., ... h_k}, where k is the number of setNumHashFunctions. 
then, the mapped hash function is 
     H(h_l)=sum_{i =0...k} (h_i*r_i) mod prime.
where r_i is a integer  and the prime can be set as 2^32-5 for less hash collision.
Then, we only need to store the hash value H(h_l) for AND operation.  

The similar approach also can be applied for the approxNeasetNeighbors searching. 

If the multi-probe approach does not need to read the hash value for one hash function (e.g.,
h_l mentioned above), we can applied this H(h_l) to the preprocessed data to save memory.
I will double check the multi-probe approach and see whether it is work.  Then, I can submit
a patch to improve the current way. Note, this hash function to reduce the vector-to-vector
comparing is widely used in different places for example, the LSH implementation by Andoni
. 
 
[~diefunction] [~mlnick] 


> Support OR-AND amplification in Locality Sensitive Hashing (LSH)
> ----------------------------------------------------------------
>
>                 Key: SPARK-19771
>                 URL: https://issues.apache.org/jira/browse/SPARK-19771
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.1.0
>            Reporter: Yun Ni
>
> The current LSH implementation only supports AND-OR amplification. We need to discuss
the following questions before we goes to implementations:
> (1) Whether we should support OR-AND amplification
> (2) What API changes we need for OR-AND amplification
> (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message