mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Schilling <chris.schill...@gmail.com>
Subject Re: sgd.TrainNewsGroups error
Date Thu, 30 Dec 2010 23:11:27 GMT
Hello Ivek,

I wanted to compare the evaluation output to some test samples.  I kind of reworked TrainNewsGroups
to simplify and remove all the target leaks and I added this simple function so that I can
compare this to the results from the exponential weighted averaging used to evaluate during
training.  

	private static void testClassifier(List<File> files, CrossFoldLearner model) throws
IOException {
		int ncorrect = 0;
		for(int i = 2501; i<=5000; ++i) {
			File test = files.get(i);
			Vector instance = encodeFeatureVector(test);
			Vector testV = model.classifyFull(instance);
			int nmax = testV.maxValueIndex();
			//System.out.println(testV.maxValue());
			String classified = newsGroups.values().get(nmax);
			String target = test.getParentFile().getName();
			if(target.equals(classified)) ++ncorrect;
		}
		System.out.println(ncorrect/2500.0);
	}

You can adapt to your needs...


On Dec 22, 2010, at 12:02 PM, ivek gimmick wrote:

> Ted,
> 
>   Is there a sample program to test the model that we generate using
> TrainNewsGroups.java?
> 
> 
> On Fri, Dec 10, 2010 at 11:50 AM, ivek gimmick <gimmickivek@gmail.com>wrote:
> 
>> Oops. sorry for not posting the stack trace.  And, yeah I know the results
>> will be non-sense, just wanted to get the hang of what is happening with the
>> print statements :)
>> 
>> and here you go!
>> 
>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>> at java.util.LinkedList.addBefore(LinkedList.java:778)
>> at java.util.LinkedList.add(LinkedList.java:198)
>> at com.google.gson.JsonArray.add(JsonArray.java:51)
>> at
>> org.apache.mahout.classifier.sgd.ModelSerializer$MatrixTypeAdapter.serialize(ModelSerializer.java:223)
>> at
>> org.apache.mahout.classifier.sgd.ModelSerializer$MatrixTypeAdapter.serialize(ModelSerializer.java:212)
>> at
>> com.google.gson.JsonSerializationVisitor.visitFieldUsingCustomHandler(JsonSerializationVisitor.java:148)
>> at
>> com.google.gson.ObjectNavigator.navigateClassFields(ObjectNavigator.java:141)
>> at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:122)
>> at
>> com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:47)
>> at
>> com.google.gson.DefaultTypeAdapters$CollectionTypeAdapter.serialize(DefaultTypeAdapters.java:445)
>> at
>> com.google.gson.DefaultTypeAdapters$CollectionTypeAdapter.serialize(DefaultTypeAdapters.java:431)
>> at
>> com.google.gson.JsonSerializationVisitor.visitFieldUsingCustomHandler(JsonSerializationVisitor.java:148)
>> at
>> com.google.gson.ObjectNavigator.navigateClassFields(ObjectNavigator.java:141)
>> at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:122)
>> at
>> com.google.gson.JsonSerializationVisitor.getJsonElementForChild(JsonSerializationVisitor.java:117)
>> at
>> com.google.gson.JsonSerializationVisitor.addAsChildOfObject(JsonSerializationVisitor.java:95)
>> at
>> com.google.gson.JsonSerializationVisitor.visitObjectField(JsonSerializationVisitor.java:90)
>> at
>> com.google.gson.ObjectNavigator.navigateClassFields(ObjectNavigator.java:147)
>> at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:122)
>> at
>> com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:47)
>> at
>> com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:40)
>> at
>> org.apache.mahout.classifier.sgd.ModelSerializer$StateTypeAdapter.serialize(ModelSerializer.java:335)
>> at
>> org.apache.mahout.classifier.sgd.ModelSerializer$StateTypeAdapter.serialize(ModelSerializer.java:289)
>> at
>> com.google.gson.JsonSerializationVisitor.visitUsingCustomHandler(JsonSerializationVisitor.java:128)
>> at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:96)
>> at
>> com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:47)
>> at
>> org.apache.mahout.classifier.sgd.ModelSerializer$EvolutionaryProcessTypeAdapter.serialize(ModelSerializer.java:377)
>> at
>> org.apache.mahout.classifier.sgd.ModelSerializer$EvolutionaryProcessTypeAdapter.serialize(ModelSerializer.java:341)
>> at
>> com.google.gson.JsonSerializationVisitor.visitUsingCustomHandler(JsonSerializationVisitor.java:128)
>> at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:96)
>> at
>> com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:47)
>> at
>> org.apache.mahout.classifier.sgd.ModelSerializer$AdaptiveLogisticRegressionTypeAdapter.serialize(ModelSerializer.java:191)
>> 
>> 
>> On Fri, Dec 10, 2010 at 11:33 AM, Ted Dunning <ted.dunning@gmail.com>wrote:
>> 
>>> Running with only two files (aka two documents) is likely to lead to
>>> nonsense, but shouldn't lead to a crash.
>>> 
>>> On Fri, Dec 10, 2010 at 8:18 AM, ivek gimmick <gimmickivek@gmail.com>
>>> wrote:
>>> 
>>>> I am trying to understand the flow of TrainNewsGroups.java.  To do this,
>>> I
>>>> just used 2 files from TwentyNewsGroups as input files.
>>>> 
>>>> The code runs and prints "exiting main", after which it takes a loooot
>>> of
>>>> time and errors out saying java heap space error.
>>>> 
>>> 
>>> The problem here is twofold:
>>> 
>>> - first, without seeing these errors I am shooting in the dark.  If you
>>> were
>>> include them, I could say more.
>>> 
>>> - second, I used GSON to serialize the model.  Big mistake.  I have since
>>> implemented a bunch of changes to allow SGD models
>>> and all related classes to be considered writables.  I also extended
>>> ModelSerializer to handle that case.  I need to check to see
>>> if I have committed those changes.  That said, you shouldn't have seen
>>> errors or excessive heap space requirements writing the model, just
>>> reading
>>> it back in.
>>> 
>>> It is also possible that since you haven't filled the high level buffer in
>>> the AdaptiveLogisticRegression, the lower level learners may be having
>>> some
>>> problems producing a model since they haven't seen any data yet.
>>> 
>>> Is there a bug somewhere?
>>>> 
>>> 
>>> Well, I consider my use of GSON for a large data structure to be a
>>> mistake.
>>> :-)
>>> 
>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message