datafu-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eyal Allweil (JIRA)" <>
Subject [jira] [Updated] (DATAFU-47) UDF for Murmur3 (and other) Hash functions
Date Tue, 05 Dec 2017 14:45:00 GMT


Eyal Allweil updated DATAFU-47:
    Attachment: DATAFU-47-new.patch

I looked at the review board for this issue, and fixed the merge conflicts in HashTests and
addressed the comments that were left. It depends on DATAFU-50, which was reopened, but I
put a new patch there so that we can proceed with both.

Since I didn't create the review, I can't upload a new diff there, but I've attached it to
the Jira issue, and commented in the review board where appropriate.

Tests pass, and I've run the content of "hasherTest" on a cluster using the assembled DataFu
jar to make sure that the autojarring of the new Guava version works properly.

I'll respond to the review board comments later.

> UDF for Murmur3 (and other) Hash functions
> ------------------------------------------
>                 Key: DATAFU-47
>                 URL:
>             Project: DataFu
>          Issue Type: Improvement
>            Reporter: Philip (flip) Kromer
>              Labels: Guava, Hash, UDF
>         Attachments: 0001-DATAFU-47-UDF-for-Murmur3-SipHash-2-4-and-other-Hash-functions.patch,
0001-UDF-for-Murmur3-and-other-Hash-functions.patch, DATAFU-47-new.patch
> Datafu should offer the murmur3 hash.
> The attached patch uses Guava to add murmur3 (a fast hash with good statistical properties),
SipHash-2-4 (a fast cryptographically secure hash), crc32, adler32, md5 and sha.
> From the javadoc:
> * 'murmur3-32', [optional seed] or 'murmur3-128', [optional seed]: Returns a [murmur3
hash|] of the given length. Murmur3 is fast, with has exceptionally
good statistical properties; it's a good choice if all you need is good mixing of the inputs.
It is _not_ cryptographically secure; that is, given an  output value from murmur3, there
are efficient algorithms to find an input yielding the same output value. Supply the seed
as a string that [Integer.decode|]
can handle.
> * 'sip24', [optional seed]: Returns a [64-bit SipHash-2-4|].
SipHash is competitive in performance with Murmur3, and is simpler and faster than the cryptographic
algorithms below. When used with a seed, it can be considered cryptographically secure: given
the output from a sip24 instance but not the seed used, we cannot efficiently craft a message
yielding the same output from that instance.
> * 'adler32': Returns an Adler-32 checksum (32 hash bits) by delegating to Java's Adler32
> * 'crc32':   Returns a CRC-32 checksum (32 hash bits) by delegating to Java's CRC32 Checksum.
> * 'md5':     Returns an MD5 hash (128 hash bits) using Java's MD5 MessageDigest.
> * 'sha1':    Returns a SHA-1 hash (160 hash bits) using Java's SHA-1 MessageDigest.
> * 'sha256':  Returns a SHA-256 hash (256 hash bits) using Java's SHA-256 MessageDigest.
> * 'sha512':  Returns a SHA-512 hash (160 hash bits) using Java's SHA-512 MessageDigest.
> * 'good-(integer number of bits)': Returns a general-purpose, non-cryptographic-strength,
streaming hash function that produces hash codes of length at least minimumBits. Users without
specific compatibility requirements and who do not persist the hash codes are encouraged to
choose this hash function. (Cryptographers, like dieticians and fashionistas, occasionally
realize that We've Been Doing it Wrong This Whole Time. Using 'good-*' lets you track What
the Experts From (Milan|NIH|IEEE) Say To (Wear|Eat|Hash With) this Fall.) Values for this
hash will change from run to run.
> Examples: 
> {code}
>   define DefaultH    datafu.pig.hash.Hasher();
>   define GoodH       datafu.pig.hash.Hasher('good-32');
>   define BetterH     datafu.pig.hash.Hasher('good-127');
>   define MurmurH32   datafu.pig.hash.Hasher('murmur3-32');
>   define MurmurH32A  datafu.pig.hash.Hasher('murmur3-32', '0x0');
>   define MurmurH32B  datafu.pig.hash.Hasher('murmur3-32', '0x56789abc');
>   define MurmurH128  datafu.pig.hash.Hasher('murmur3-128');
>   define MurmurH128A datafu.pig.hash.Hasher('murmur3-128', '0x0');
>   define MurmurH128B datafu.pig.hash.Hasher('murmur3-128', '-12345678');
>   define MD5H        datafu.pig.hash.Hasher('md5');
>   define SHA1H       datafu.pig.hash.Hasher('sha1');
>   define SHA256H     datafu.pig.hash.Hasher('sha256');
>   define SHA512H     datafu.pig.hash.Hasher('sha512');
>   data_in = LOAD 'input' as (val:chararray);
>   data_out = FOREACH data_in GENERATE
>     DefaultH(val),   GoodH(val),       BetterH(val),
>     MurmurH32(val),  MurmurH32A(val),  MurmurH32B(val),
>     MurmurH128(val), MurmurH128A(val), MurmurH128B(val),
>     SHA1H(val),       SHA256H(val),    SHA512H(val),
>     MD5H(val)
>     ;
>   STORE data_out INTO 'output';
> {code}
> In practice: 
> {code}
>   -- Consistent shuffle of large dataset with only one full-table reduce step. 
>   -- Every pig run with the same seed will generate sorted output in the same order
>   define MurmurH32   datafu.pig.hash.Hasher('murmur3-32');
>   -- Force each file to go in whole to a single mapper (or in the LOAD use -tagSplit,
to be added in future Pig version)
>   SET mapred.max.split.size 1099511627776;
>   -- -tagPath option labels each file
>   data_in = LOAD 'input' USING PigStorage('\t', '-tagPath') AS (path:chararray, val:chararray);
>   data_numbered = RANK data_in;
>   data_ided = FOREACH numbered GENERATE
>     MurmurH32(CONCAT((chararray)path, '#', (chararray)rank_data_in)) AS shuffle_key,
>     val AS val;
>   data_shuffled = FOREACH (ORDER data_ided BY shuffle_key) GENERATE val;
>   STORE data_shuffled INTO 'data_shuffled';
> {code}
> Important notes about this patch:
> * It should be applied _after_ the patch for DATAFU-46 and DATAFU-48.
> * -(It expands the dependence on Guava. Does [pull req 75|]
mean there's momentum to de-Guava datafu?)- 
> * -(The patch has (commented out) code that shows what life would be like if the sip24,
crc32 and adler32 hashes were available. On your advice, I will either (a) put in a patch
removing the spurious comments or (b) file a separate bug to update guava, push in a patch
for that, and put in a patch restoring to glory the extra hashes.)-

This message was sent by Atlassian JIRA

View raw message