mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From apalu...@apache.org
Subject mahout git commit: MAHOUT-1564: Naive Bayes Classifier for New Text Documents closes apache/mahout#91
Date Wed, 01 Apr 2015 21:22:36 GMT
Repository: mahout
Updated Branches:
  refs/heads/master 3eb9fdf92 -> 441460e77


MAHOUT-1564: Naive Bayes Classifier for New Text Documents closes apache/mahout#91


Project: http://git-wip-us.apache.org/repos/asf/mahout/repo
Commit: http://git-wip-us.apache.org/repos/asf/mahout/commit/441460e7
Tree: http://git-wip-us.apache.org/repos/asf/mahout/tree/441460e7
Diff: http://git-wip-us.apache.org/repos/asf/mahout/diff/441460e7

Branch: refs/heads/master
Commit: 441460e77cd38acc684cb2351dad5f0e6156c1f0
Parents: 3eb9fdf
Author: Andrew Palumbo <apalumbo@apache.org>
Authored: Wed Apr 1 17:20:39 2015 -0400
Committer: Andrew Palumbo <apalumbo@apache.org>
Committed: Wed Apr 1 17:21:47 2015 -0400

----------------------------------------------------------------------
 CHANGELOG                                     |   6 +-
 examples/bin/spark-document-classifier.mscala | 195 +++++++++++++++++++++
 2 files changed, 200 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/mahout/blob/441460e7/CHANGELOG
----------------------------------------------------------------------
diff --git a/CHANGELOG b/CHANGELOG
index 3c29278..45869e4 100644
--- a/CHANGELOG
+++ b/CHANGELOG
@@ -1,7 +1,11 @@
 Mahout Change Log
 
 Release 0.10.0 - unreleased
-
+  
+  MAHOUT-1564: Naive Bayes Classifier for New Text Documents (apalumbo)
+  
+  MAHOUT-1524: Script to auto-generate and view the Mahout website on a local machine (Saleem
Ansari via apalumbo)
+  
   MAHOUT-1589: Deprecate mahout.cmd due to lack of support
 
   MAHOUT-1655: Refactors mr-legacy into mahout-hdfs and mahout-mr, Spark now depends on much
reduced mahout-hdfs

http://git-wip-us.apache.org/repos/asf/mahout/blob/441460e7/examples/bin/spark-document-classifier.mscala
----------------------------------------------------------------------
diff --git a/examples/bin/spark-document-classifier.mscala b/examples/bin/spark-document-classifier.mscala
new file mode 100644
index 0000000..9700253
--- /dev/null
+++ b/examples/bin/spark-document-classifier.mscala
@@ -0,0 +1,195 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+*/
+ 
+/*
+ * Binary Naive Bayes classifer (United States, United Kingdom) example for an out of sample
document based 
+ * on a model trained on the wikipedia xml dump: 
+ *
+ * NOTE: As of version 0.10.0 Mahout uses MapReduce seq2sparse to vectorize large text corpora.
+ * 
+ * To run this example first run :
+ *    $MAHOUT_HOME/examples/bin/classify-wikipedia.sh --> option 2 
+ *
+ * then from the mahout spark-shell:
+ *    :load $MAHOUT_HOME/examples/spark-document-classifier.mscala
+*/
+ 
+import org.apache.mahout.classifier.naivebayes._
+import org.apache.mahout.classifier.stats._
+import org.apache.mahout.nlp.tfidf._
+
+import org.apache.hadoop.io.Text
+import org.apache.hadoop.io.IntWritable
+import org.apache.hadoop.io.LongWritable
+
+val pathToData = "/tmp/mahout-work-wiki/"
+
+// read in our full set as vectorized by seq2sparse in classify-wikipedia.sh
+val fullData = drmDfsRead(pathToData + "wikipediaVecs/tfidf-vectors")
+//val trainData = drmDfsRead(pathToData + "training")
+//val testData = drmDfsRead(pathToData + "testing")
+
+// build a standard NaiveBayes model using the full dataset (training +testing) 
+val (labelIndex, aggregatedObservations) = SparkNaiveBayes.extractLabelsAndAggregateObservations(fullData)
+val model = NaiveBayes.train(aggregatedObservations, labelIndex, false) 
+
+// self test on the full set 
+val resAnalyzer = NaiveBayes.test(model, fullData, false)
+
+// display the confusion matrix
+println(resAnalyzer)
+
+// read in the dictionary and document frequency count
+val dictionary = sdc.sequenceFile(pathToData + "wikipediaVecs/dictionary.file-0", classOf[Text],
classOf[IntWritable])
+val documentFrequencyCount = sdc.sequenceFile(pathToData + "wikipediaVecs/df-count", classOf[IntWritable],
classOf[LongWritable])
+
+// setup the dictionary and document frequency count as maps
+val dictionaryRDD = dictionary.map { case (wKey, wVal) => wKey.asInstanceOf[Text].toString()
-> wVal.get() }
+val documentFrequencyCountRDD = documentFrequencyCount.map{ case (wKey, wVal) => wKey.asInstanceOf[IntWritable].get()
-> wVal.get() }
+
+val dictionaryMap = dictionaryRDD.collect.map(x => x._1.toString -> x._2.toInt).toMap
+val dfCountMap = documentFrequencyCountRDD.collect.map(x => x._1.toInt -> x._2.toLong).toMap
+
+// for this simple example, tokenize our document into unigrams using native string methods
andvectorize using 
+// our dictionary and document frequencies.  You could also use a lucene analyzer for bigrams,
trigrams, etc.   
+def vectorizeDocument(document: String,
+		      dictionaryMap: Map[String,Int],
+		      dfMap: Map[Int,Long]): Vector = {
+  
+  val wordCounts = document.replaceAll("[^\\p{L}\\p{Nd}]+", " ").toLowerCase.split(" ").groupBy(identity).mapValues(_.length)
+
+  val vec = new RandomAccessSparseVector(dictionaryMap.size) 
+  
+  val totalDFSize = dfMap(-1)
+  val docSize = wordCounts.size
+  
+  for (word <- wordCounts) {    
+    val term = word._1
+    if (dictionaryMap.contains(term)) {
+      val tfidf: TFIDF = new TFIDF()
+      val termFreq = word._2
+      val dictIndex = dictionaryMap(term)
+      val docFreq = dfCountMap(dictIndex)
+      val currentTfIdf = tfidf.calculate(termFreq, docFreq.toInt, docSize, totalDFSize.toInt)
+      vec.setQuick(dictIndex, currentTfIdf)
+    }
+  }
+  vec
+}
+
+val labelMap = model.labelIndex
+val numLabels = model.numLabels
+val reverseLabelMap = labelMap.map(x => x._2 -> x._1)
+
+// instantiate the correct type of classifier
+val classifier = model.isComplementary match {
+  case true => new ComplementaryNBClassifier(model) 
+  case _ => new StandardNBClassifier(model) 
+}
+    
+// the label with the higest score wins the classification for a given document
+def argmax(v: Vector): (Int, Double) = {
+  var bestIdx: Int = Integer.MIN_VALUE
+  var bestScore: Double = Integer.MIN_VALUE.asInstanceOf[Int].toDouble
+  for(i <- 0 until v.size) {
+    if(v(i) > bestScore){
+      bestScore = v(i)
+      bestIdx = i
+    }
+  }
+  (bestIdx, bestScore)
+}
+  
+// our final classifier
+def classifyDocument(clvec: Vector) : String ={
+  val cvec = classifier.classifyFull(clvec)
+  val (bestIdx, bestScore) = argmax(cvec)
+  reverseLabelMap(bestIdx)
+}   
+
+// A random United States footbal article
+//http://www.reuters.com/article/2015/01/28/us-nfl-superbowl-security-idUSKBN0L12JR20150128
+val UStextToClassify = new String("(Reuters) - Super Bowl security officials acknowledge
the NFL championship game represents" +
+  " a high profile target on a world stage but are unaware of any specific credible threats
against" + 
+  " Sunday's showcase. In advance of one of the world's biggest single day sporting events,
Homeland" + 
+  " Security Secretary Jeh Johnson was in Glendale on Wednesday to review security preparations
and" + 
+  " tour University of Phoenix Stadium where the Seattle Seahawks and New England Patriots
will battle." +
+  " Deadly shootings in Paris and arrest of suspects in Belgium, Greece and Germany heightened
fears of" +
+  " more attacks around the world and social media accounts linked to Middle East militant
groups have" +
+  " carried a number of threats to attack high-profile U.S. events. There is no specific
credible" +
+  " threat, said Johnson, who has appointed a federal coordination team to work with local,
state and" +
+  " federal agencies to ensure safety of fans, players and other workers associated with
the Super Bowl." +
+  " I'm confident we will have a safe and secure and successful event. Sunday's game has
been given a" +
+  " Special Event Assessment Rating (SEAR) 1 rating, the same as in previous years, except
for the year" +
+  " after the Sept. 11, 2001 attacks, when a higher level was declared. But security will
be tight and" +
+  " visible around Super Bowl-related events as well as during the game itself. All fans
will pass through" +
+  " metal detectors and pat downs. Over 4,000 private security personnel will be deployed
and the almost" +
+  " 3,000 member Phoenix police force will be on Super Bowl duty. Nuclear device sniffing
teams will be" +
+  " deployed and a network of Bio-Watch detectors will be set up to provide a warning in
the event of " +
+  " a biological attack. The Department of Homeland Security (DHS) said in a press release
it had held " +
+  " special cyber-security and anti-sniper training sessions. A U.S. official said the Transportation
" +
+  " Security Administration, which is responsible for screening airline passengers, will
add screeners " +
+  " and checkpoint lanes at airports. Federal air marshals, behavior detection officers and
dog teams " +
+  " will help to secure transportation systems in the area. We will be ramping it (security)
up on Sunday," +
+  " there is no doubt about that, said Federal Coordinator Matthew Allen, the DHS point of
contact for " +
+  " planning and support. I have every confidence the public safety agencies that represented
in the " +
+  " planning process are going to have their best and brightest out there this weekend and
we will have" +
+  " a very safe Super Bowl.")
+
+// A random United Kingdom footbal article 
+// http://www.reuters.com/article/2015/01/26/manchester-united-swissquote-idUSL6N0V52RZ20150126
+val UKtextToClassify = new String("(Reuters) - Manchester United have signed a sponsorship
deal with online financial trading company" +
+  " Swissquote, expanding the commercial partnerships that have helped to make the English
club one of" +
+  " the richest teams in world soccer. United did not give a value for the deal, the club's
first in the" +
+  " sector, but said on Monday it was a multi-year agreement. The Premier League club, 20
times English" +
+  " champions, claim to have 659 million followers around the globe, making the United name
attractive to" +
+  " major brands like Chevrolet cars and sportswear group Adidas. Swissquote said the global
deal would" +
+  " allow it to use United's popularity in Asia to help it meet its targets for expansion
in China. Among" +
+  " benefits from the deal, Swissquote's clients will have a chance to meet United players
and get behind" +
+  " the scenes at the Old Trafford stadium. Swissquote is a Geneva-based online trading company
that allows" +
+  " retail investors to buy and sell foreign exchange, equities, bonds and other asset classes.
Like other" +
+  " retail FX brokers, Swissquote was left nursing losses on the Swiss franc after Switzerland's
central bank" +
+  " stunned markets this month by abandoning its cap on the currency. The fallout from the
abrupt move put rival" +
+  " and West Ham United shirt sponsor Alpari UK into administration. Swissquote itself was
forced to book a 25 "+ 
+  " million Swiss francs ($28 million) provision for its clients who were left out of pocket
following the" +
+  " franc's surge. United's ability to grow revenues off the pitch has made them the second
richest club in" +
+  " the world behind Spain's Real Madrid, despite a downturn in their playing fortunes. United
Managing" +
+  " Director Richard Arnold said there was still lots of scope for United to develop sponsorships
in" + 
+  " other areas of business. The last quoted statistics that we had showed that of the top
25 sponsorship" +
+  " categories, we were only active in 15 of those, Arnold told Reuters. I think there is
a huge potential" +
+  " still for the club, and the other thing we have seen is there is very significant growth
even within" + 
+  " categories. United have endured a tricky transition following the retirement of manager
Alex Ferguson" +
+  " in 2013, finishing seventh in the Premier League last season and missing out on a place
in the lucrative" +
+  " Champions League. ($1 = 0.8910 Swiss francs) (Writing by Neil Maidment, additional reporting
by Jemima Kelly;" +
+  " editing by Keith Weir)")
+
+val usVec = vectorizeDocument(UStextToClassify, dictionaryMap, dfCountMap)
+val ukVec = vectorizeDocument(UKtextToClassify, dictionaryMap, dfCountMap)
+
+println("Classifing the news article about the superbowl (united states)")
+classifyDocument(usVec)
+
+println("Classifing the news article about the Manchester United (united kingdom)")
+classifyDocument(ukVec)
+
+// to classify new text, simply run this method on a string
+def classifyText(txt: String): String ={
+  val v = vectorizeDocument(txt, dictionaryMap, dfCountMap)
+  classifyDocument(v)
+}
+  
+


Mime
View raw message