hivemall-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From helenahm <...@git.apache.org>
Subject [GitHub] incubator-hivemall issue #93: [WIP][HIVEMALL-126] Maximum Entropy Model usin...
Date Wed, 02 Aug 2017 06:52:51 GMT
Github user helenahm commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/93
  
    It will include some work. 
    
    Let me explain.
    
    You were right when you have said that OpenNLP implementation is poor memory-wise. Indeed,
they store data in [][] and few times. Using their code directly causes Java Heap Space, GC
errors, etc. (Tested that on my 97 mil of data rows. Newer version of code has same problems.)
And you were right about the wonderful CSRMatrix. And DoKMatrix too. They allow to store more
data. Thus, more or less, I have changed all the [][] (related to input data) to CSRMatrix
and [][] holding weights to  DoKMatrix. 
    
    
    To explain that more, it is best to look at source code for the GISTrainer. In fact all
3 of them, old maxent, new maxent, and Hivemall's BigGISTrainer. The links are below. 
    
    Newer GISTrainer:
    https://github.com/apache/opennlp/blob/master/opennlp-tools/src/main/java/opennlp/tools/ml/maxent/GISTrainer.java
    
    Older (3.0.0) GISTrainer:
    https://sourceforge.net/projects/maxent/files/ - whole achive
    GISTrainer attached:
    [GISTrainer.txt](https://github.com/apache/incubator-hivemall/files/1192806/GISTrainer.txt)
    
    Hivemall GISTrainer:
    https://github.com/helenahm/incubator-hivemall/blob/master/core/src/main/java/hivemall/opennlp/tools/BigGISTrainer.java
    
    Notice how trainModel of BigGISTrainer gets MatrixForTraining (https://github.com/helenahm/incubator-hivemall/blob/master/core/src/main/java/hivemall/opennlp/tools/MatrixForTraining.java),
that contains references to Matrix, and outcomes. This is CSRMatrix. 
    
    And row data is collected from the CSRMatrix in MatrixForTraining instead of the double[][].

    
    when
    ComparableEvent ev = x.createComparableEvent(ti, di.getPredicateIndex(), di.getOMap());
    (they use this convenience Event thing to work with a row of data. Instead of storing
a List of Events in memory the modified code also builds an event when needed.)
    
    and results are stored in 
    Matrix predCount = new DoKMatrix(numPreds, numOutcomes); instead of [][] again.
    
    GISTrainer did not change very dramatically. If 3.0.0 training is reliable enough, I would,
of course, consider the existing version as 1.0, and did all the effort to adapt GISTrainer
later on. It makes sense to do that, I totally agree. And perhaps it makes sense to continue
after that to understanding training process in greater details and perhaps write a newer
comparable trainer that will be independent from OpenNLP. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message