Matthew Hayes closed DATAFU2.

> UDFs for entropy and weighted sampling algorithms
> 
>
> Key: DATAFU2
> URL: https://issues.apache.org/jira/browse/DATAFU2
> Project: DataFu
> Issue Type: Task
> Reporter: Matthew Hayes
> Assignee: Matthew Hayes
> Fix For: 1.3.0
>
> Attachments: 0001createinitialversionofentroyUDFs.patch, 0002updateafewcommentsanderrormessages.patch,
0003fixabuginEntropy.accumulatetousegetFreqmetho.patch, 0004updateentropyimplementationfollowingcodereview.patch,
0005updatejavadocs.patch, 0006updatejavadocs.patch, 0007updatethejavadocsofstreamingempiricalentropya.patch,
0008updateentropyudfsbasedoncodereview.patch, 0009Implementandexperimentwithdifferentweightedsam.patch,
0010updateweightedreservoirsamplerconstructorunitt.patch, 0011updatelicenceheadersandmovestreamingentropyto.patch,
0012addmissinglicenceheader.patch
>
>
> Jian Wang has suggested that we add UDFs for entropy and weighted random sampling and
has implementations for each of these ready.
> In Jian's words:
> "In the real world, there are occasions we need to calculate the entropy of discrete
random variables, for instance, to calculate the mutual information between variable X and
Y using its entropybased formula(mutual information calculation could be found at http://en.wikipedia.org/wiki/Mutual_information#Relation_to_other_quantities).
Would suggest to implement a UDF to calculate the entropy of given input samples, following
the definition at http://en.wikipedia.org/wiki/Entropy_%28information_theory%29
> This is the reference paper I use to learn about the weighted sampleing algorithm: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf
> The present WeightedSample.java implements the Algorithm D.
> We may try Algorithm A, Ares and AexpJ since they could be used in a data stream and
distributed environment. These algorithms could be implemented based on ReservoirSample.java(inherit
from this class?) since they also need a reservior to store the selected items."

