datafu-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mohammad Amin" <ma...@linkedin.com>
Subject Re: Review Request 25049: DATAFU-67
Date Wed, 03 Sep 2014 20:58:09 GMT


> On Aug. 30, 2014, 4:30 a.m., wang jian wrote:
> > datafu-pig/src/main/java/datafu/pig/hash/SimHash.java, line 78
> > <https://reviews.apache.org/r/25049/diff/1/?file=668674#file668674line78>
> >
> >     could you please share the tutorial that describes the algorithm? Are there
any other SimHash algorithms we could also support?

http://moultano.wordpress.com/2010/01/21/simple-simhashing-3kbzhsxyg4467-6/ 
There are other variants which we can iterate on as SimHash v2.


> On Aug. 30, 2014, 4:30 a.m., wang jian wrote:
> > datafu-pig/src/main/java/datafu/pig/hash/SimHash.java, line 93
> > <https://reviews.apache.org/r/25049/diff/1/?file=668674#file668674line93>
> >
> >     It seems that here only tri-grams are used instead of n-gram generated, input
parameter "n" is not used in this function? Should we use a sort of sliding window to implement
this?
> >     
> >     private List<String> computeNGramShingles(String line, int n) {
> >     
> >          List<String> result = new ArrayList<String>(n);
> >     
> >          String[] circularQueue = new String[n];
> >          StringTokenizer st = new StringTokenizer(line);
> >     
> >          int index = 0;
> >          int circularQueueSize = 0;
> >     
> >          StringBuffer strBuf = new StringBuffer();
> >     
> >          while (st.hasMoreElements()) {
> >              String token = st.nextToken();
> >              if (circularQueueSize == n)
> >              {
> >                  strBuf.setLength(0);
> >                  for(int pn = 0; pn < n; pn++)
> >                  {
> >                     if (pn > 0)
> >                     {
> >                         strBuf.append(" ");
> >                     }
> >                     strBuf.append(circularQueue[(index + pn) % n]);
> >                  }
> >                  result.add(strBuf.toString());
> >                  index = (index + 1) % n;
> >                  circularQueueSize--;
> >              }
> >              circularQueue[(index + circularQueueSize) % n] = token;
> >              if (circularQueueSize < n)
> >              {
> >                  circularQueueSize++;
> >              }
> >          }
> >     
> >          if (circularQueueSize == n)
> >          {
> >              strBuf.setLength(0);
> >              for(int pn = 0; pn < n; pn++)
> >              {
> >                  if (pn > 0)
> >                  {
> >                     strBuf.append(" ");
> >                  }
> >                  strBuf.append(circularQueue[(index + pn) % n]);
> >              }
> >              result.add(strBuf.toString());
> >          }
> >     
> >          return result;
> >     }
> >     
> >     The complete test class: https://github.com/king821221/coding/blob/master/NGram.java

Added NGram function instead of 3-gram.


- Mohammad


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25049/#review51486
-----------------------------------------------------------


On Sept. 3, 2014, 8:55 p.m., Mohammad Amin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25049/
> -----------------------------------------------------------
> 
> (Updated Sept. 3, 2014, 8:55 p.m.)
> 
> 
> Review request for DataFu.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> DATAFU-67. Adding Simple SimHash to compute near duplicates.
> https://issues.apache.org/jira/browse/DATAFU-67
> 
> 
> Diffs
> -----
> 
>   datafu-pig/src/main/java/datafu/pig/hash/SimHash.java PRE-CREATION 
>   datafu-pig/src/test/java/datafu/test/pig/hash/HashTests.java 7ff8fb9 
> 
> Diff: https://reviews.apache.org/r/25049/diff/
> 
> 
> Testing
> -------
> 
> Unit tests passed.
> 
> 
> Thanks,
> 
> Mohammad Amin
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message