mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Loek Cleophas <loek.cleop...@kalooga.com>
Subject Issues with memory use and inconsistent or state-influenced results when using CBayesAlgorithm
Date Tue, 30 Mar 2010 17:15:39 GMT
Hi,

after my initial experiments with Mahout's Bayes and CBayes  
implementations on my company's dataset, we're now trying to integrate  
Mahout to classify our data in a production environment. We are  
however running into two odd issues, after having succesfuly trained a  
classifier (using CBayes).

We're loading the trained model into an InMemoryBayesDataStore, and  
are able to get classification results (i.e. categories plus weights).  
However, we're seeing two odd issues:

1) it turns out the classifier's memory use increases by classifying a  
document; as a result, after a number of documents to classify, we run  
into memory issues.
2) somehow, classification is not consistent: e.g. if we classify text  
1, 2, 3, and then 1 again, the second time text 1 is fed, we get  
slightly different weights - not by a lot, but not by little enough to  
discard it as floating point rounding issues; and if we classify text  
1 and then 1 again without any intermediate classification on other  
texts, the weights do not change.

My colleagues and I have looked at the Mahout code, and it seems that  
the memory use increase is due to getLabelID in InMemoryBayesDatastore  
- which adds a label to a dictionary if it's not in there yet, but  
never seems to remove any labels from the dictionary. Could this be  
the source of the memory issue? I can imagine that if you're adding  
words that were not in the model but occur in text to be classified,  
this might increase memory use but probably shouldn't be happening (as  
it's classification, not training).

Any thoughts on these two issues, whether they're related, and what to  
do about them?

Robin, I suspect/hope you're able to help here?

Regards,
Loek

Mime
View raw message