datafu-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Hayes <>
Subject Apache DataFu (incubating) 1.3.0 released
Date Tue, 17 Nov 2015 17:59:33 GMT
Hi all,

I'd like to announce the release of Apache DataFu (incubating) 1.3.0.  This
is the first release since entering the Apache incubator.  Thanks to all
who contributed!

Apache DataFu is a collection of libraries for working with large-scale
data in Hadoop. The project was inspired by the need for stable,
well-tested libraries for data mining and statistics.  It consists of two
libraries: Apache DataFu Pig, a collection of user-defined functions for
Apache Pig, and Apache DataFu Hourglass, an incremental processing
framework for Apache Hadoop in MapReduce.

You can obtain the source release from:

Please follow the README for instructions on building.  A summary of
changes for 1.3.0 appears below.


* New UDFs for entropy and weighted sampling algorithms (DATAFU-2,
* Updated SimpleRandomSample to be consistent with
SimpleRandomSampleWithReplacement (DATAFU-5)
* Created OpenNLP UDF wrappers (DATAFU-8)
* Created RandomUUID UDF (DATAFU-18)
* Added LSH implementation (DATAFU-37)
* Added Base64Encode/Decode (DATAFU-52)
* Created SelectFieldByName UDF (DATAFU-69)
* Added generic BagJoin that supports inner, left, and full outer joins
* Added ZipBags UDF which can zip and arbitrary number of bags into one
* Hadoop 2.0 compatibility (DATAFU-58)
* Created file (DATAFU-92)


* Simplified BagGroup output (DATAFU-42)


* StagedOutputJob no longer writes counters by default (DATAFU-35)


* ReservoirSample does not behave as expected when grouping by a key other
than ALL (DATAFU-11)
* DistinctBy does not work correctly on strings containing minuses
* Hourglass does not honor "fail on missing" in all cases (DATAFU-35)
* Hash UDFs return zero-padded strings of uniform length even when leading
bits are zero (DATAFU 46)
* UDF examples work again (DATAFU-49)
* SampleByKey can throw NullPointerException (DATAFU-68)

Build system:

* Removed legacy checked in jars (DATAFU-55)
* Updated to use Pig 0.12.1 (DATAFU-10)
* Switched from Ant to Gradle 1.12 (DATAFU-27, DATAFU-44, DATAFU-43,
* Removed checked in jars, download where necessary (DATAFU-55)
* Fixed to use gradlew (DATAFU-77)

Release related:

* NOTICE updated with dependencies used or shipped with DataFu.
* Apache license headers added to all necessary files (DATAFU-4, DATAFU-75)
* Added doap file (DATAFU-36)
* Source tarball generation, gradle bootstrapping, and release instructions
* Removed author tags (DATAFU-74)
* Resolved issues with build-plugin directory (DATAFU-76)
* Used Apache RAT to verify correct file headers (DATAFU-73, DATAFU-84)

Documentation related:

* New website (DATAFU-20, etc.)
* StreamingQuantile PDF link is broken (DATAFU-29)
* README file updated


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message