hivemall-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From helenahm <...@git.apache.org>
Subject [GitHub] incubator-hivemall pull request #93: Maximum Entropy Model
Date Sun, 02 Jul 2017 04:21:55 GMT
GitHub user helenahm opened a pull request:

    https://github.com/apache/incubator-hivemall/pull/93

    Maximum Entropy Model

    ## What changes were proposed in this pull request?
    
    A Distributed Max Entropy Model
    
    ## What type of PR is it?
    
    Feature
    
    ## What is the Jira issue?
    
    ?
    
    ## How was this patch tested?
    
    There are two tests at  the moment, hivemall.smile.classification.MaxEntUDTFTest.java
    and hivemall.smile.tools.TreePredictUDFTest.java
    
    plus I have tested the code on EMR:
    
    add jar hivemall-core-0.4.2-rc.2-maxent-with-dependencies.jar;
    add jar opennlp-maxent-3.0.0.jar;
    source define-all.hive;
    create temporary function train_maxent_classifier as 'hivemall.smile.classification.MaxEntUDTF';
    create temporary function predict_maxent_classifier as 'hivemall.smile.tools.MaxEntPredictUDF';
    drop table tmodel_maxent;
    CREATE TABLE tmodel_maxent 
    STORED AS SEQUENCEFILE 
    AS
    select 
      train_maxent_classifier(features, klass, "-attrs 
    
    Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,Q,Q,Q,Q,Q,Q,Q,Q")

    from
      t_test_maxent;
    
    create table tmodel_combined as
    select model, attributes, features, klass from t_test_maxent join tmodel_maxent;
    
    create table tmodel_predicted as
    select
    predict_maxent_classifier(model, attributes, features) result, klass from tmodel_combined;
    
    Source table:
    drop table t_test_maxent;
    create table t_test_maxent as select
    array( x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30,x31,x32,x33,x34,x35,x36,
    cast(tWord(x37) as double),
    cast(tWord(x38) as double),
    cast(tWord(x39) as double),
    cast(tWord(x40) as double),
    cast(tWord(x41) as double),
    cast(tWord(x42) as double),
    cast(tWord(x43) as double),
    cast(tWord(x44) as double),
    cast(contentWord(x45) as double),
    cast(contentWord(x46) as double),
    cast(contentWord(x47) as double),
    cast(contentWord(x48) as double),
    cast(contentWord(x49) as double),
    cast(contentWord(x50) as double),
    cast(contentWord(x51) as double),
    cast(contentWord(x52) as double),
    cast(contentWord(x53) as double),
    cast(presentationWord(x54) as double),
    cast(presentationWord(x55) as double),
    cast(presentationWord(x56) as double),
    cast(presentationWord(x57) as double),
    cast(presentationWord(x58) as double),
    cast(presentationWord(x59) as double),
    cast(presentationWord(x60) as double),
    cast(presentationWord(x61) as double),
    cast(presentationWord(x62) as double),
    x63,x64,x65,x66,x67,x68,x69,x70) features
    , klass from pdfs_and_tiffs_instances_combined_instances where regexp_replace(tp, 'T',
'') == '76_698_855_347';
    
    
    ## How to use this feature?
    
    Maximum Entropy Classifier is, from my point of view, the most useful classification technique
for many NLP tasks and many other tasks that are not related to NLP. It is used for part of
speech tagging, NER, and some other tasks.
    
    I have been searching for a distributed version of it and found one article only that
talks about it. "Efficient Large Scale Distributed Training of Conditional Maximum Entropy
Models" by Mehryar Mohri [quite well-known] and his colleagues at Google. (Please, let me
know how I can send you the article if you will not get it by googling). Thus, I think it
is time to implement that. I plan to use Mixture Weight Method they describe.
    
    By now a final udaf is still to be implemented (the one that collects all the models and
averages the weights), that I plan to commit next week. 
    
    See if you like the idea and will accept the code. It is based on Apache maxent, that
is open source and is written in a simple way.
    
    Regards,
    Elena.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/helenahm/incubator-hivemall master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-hivemall/pull/93.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #93
    
----
commit 45a656aa7278066ce3fc36fcd81fb1eca11f1079
Author: helenahm <helenahm@users.noreply.github.com>
Date:   2017-06-02T05:10:13Z

    Update LDAUDTFTest.java

commit fef9c1ce719d3924a28cc90d71d40728dc5c7563
Author: helenahm <helenahm@users.noreply.github.com>
Date:   2017-06-02T05:22:54Z

    Merge pull request #1 from helenahm/helenahm-patch-1
    
    Update LDAUDTFTest.java

commit e92b13aa3cb4fc193ea3da3fadd8a8fe8a6a073b
Author: AKHMATOVA, Elena <elena.akhmatova@suncorp.com.au>
Date:   2017-07-02T03:41:14Z

    maxent

commit d4031550f80007045353f1e24e58c99244ab3db3
Author: AKHMATOVA, Elena <elena.akhmatova@suncorp.com.au>
Date:   2017-07-02T03:49:16Z

    maxent cont.

commit f921d91fe8a1958cfd198236219c129355ef2fea
Author: AKHMATOVA, Elena <elena.akhmatova@suncorp.com.au>
Date:   2017-07-02T03:54:38Z

    maxent cont.

commit 2a712edfd9bbe765bb2781f84b519e283fe6bd56
Author: helenahm <helenahm@users.noreply.github.com>
Date:   2017-07-02T03:59:59Z

    Update LDAUDTFTest.java

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message