datafu-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthew Hayes (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DATAFU-26) Resolve entropy UDF naming conventions
Date Sun, 16 Feb 2014 18:27:21 GMT

    [ https://issues.apache.org/jira/browse/DATAFU-26?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13902772#comment-13902772
] 

Matthew Hayes commented on DATAFU-26:
-------------------------------------

EmpiricalCountEntropy seems like an appropriate name to me. 

I think it's fine to rename StreamingEntropy to Entropy and not add some indication in the
name of it being in sorted order.  We should just make this clear in the documentation.  
For Quantile, on the first line of the documentation we put: "Computes quantiles for a sorted
input bag, using type R-2 estimation." where "sorted" is bold.  

Agree that detailed usage scenario explanations would help.

> Resolve entropy UDF naming conventions
> --------------------------------------
>
>                 Key: DATAFU-26
>                 URL: https://issues.apache.org/jira/browse/DATAFU-26
>             Project: DataFu
>          Issue Type: Task
>            Reporter: Matthew Hayes
>            Assignee: jian wang
>             Fix For: 1.3.0
>
>
> There are a couple issues with the naming of entropy UDFs that we should work out before
the next release.
> StreamingEntropy supports multiple estimation methods.  Entropy however only support
empirical.  The supported constructors are also different as a result.  Although Entropy's
documentation states it computes the empirical entropy, I think the name itself may lead to
confusion.  
> StreamingEntropy takes data the data in sorted order.  Using this sorted data it computes
count, which are then used to compute entropy.  Entropy on the other hand takes counts directly
and computes entropy.  These counts need to be computed before calling it.  Our convention
in DataFu has been that "Streaming" implies that the data does not need to be sorted.  So
StreamingEntropy is in conflict with this.
> My proposal is:
> 1) Rename Entropy to EmpiricalEntropy
> 2) Rename StreamingEntropy to Entropy
> 3) Clearly document why you would use EmpiricalEntropy over Entropy.  It will be more
efficient in some scenarios and we should explain this.
> One open question I have is whether we should distinguish in the name somehow that EmpiricalEntropy
accepts counts, not the actual items themselves.  EmpiricalCountBasedEntropy seems verbose.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message