Return-Path: X-Original-To: apmail-mahout-commits-archive@www.apache.org Delivered-To: apmail-mahout-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 60D50176E2 for ; Thu, 23 Apr 2015 00:57:56 +0000 (UTC) Received: (qmail 98445 invoked by uid 500); 23 Apr 2015 00:57:56 -0000 Delivered-To: apmail-mahout-commits-archive@mahout.apache.org Received: (qmail 98392 invoked by uid 500); 23 Apr 2015 00:57:56 -0000 Mailing-List: contact commits-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list commits@mahout.apache.org Received: (qmail 98383 invoked by uid 99); 23 Apr 2015 00:57:56 -0000 Received: from eris.apache.org (HELO hades.apache.org) (140.211.11.105) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Apr 2015 00:57:56 +0000 Received: from hades.apache.org (localhost [127.0.0.1]) by hades.apache.org (ASF Mail Server at hades.apache.org) with ESMTP id 1B501AC0294 for ; Thu, 23 Apr 2015 00:57:56 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r948822 - in /websites/staging/mahout/trunk/content: ./ users/environment/classify-a-doc-from-the-shell.html Date: Thu, 23 Apr 2015 00:57:56 -0000 To: commits@mahout.apache.org From: buildbot@apache.org X-Mailer: svnmailer-1.0.9 Message-Id: <20150423005756.1B501AC0294@hades.apache.org> Author: buildbot Date: Thu Apr 23 00:57:55 2015 New Revision: 948822 Log: Staging update by buildbot for mahout Added: websites/staging/mahout/trunk/content/users/environment/classify-a-doc-from-the-shell.html Modified: websites/staging/mahout/trunk/content/ (props changed) Propchange: websites/staging/mahout/trunk/content/ ------------------------------------------------------------------------------ --- cms:source-revision (original) +++ cms:source-revision Thu Apr 23 00:57:55 2015 @@ -1 +1 @@ -1675192 +1675527 Added: websites/staging/mahout/trunk/content/users/environment/classify-a-doc-from-the-shell.html ============================================================================== --- websites/staging/mahout/trunk/content/users/environment/classify-a-doc-from-the-shell.html (added) +++ websites/staging/mahout/trunk/content/users/environment/classify-a-doc-from-the-shell.html Thu Apr 23 00:57:55 2015 @@ -0,0 +1,535 @@ + + + + + Apache Mahout: Scalable machine learning and data mining + + + + + + + + + + + + + + + + + + + + +
+ + + + +
+
+

Classifying a Document with the Mahout Shell

+

This tutorial assumes that you have Spark configured for the spark-shell See Playing with Mahout's Shell. As well we assume that Mahout is running in cluster mode (i.e. with the MAHOUT_LOCAL environment variable unset) so that the output is put into HDFS.

+

Downloading and Vectorizing the wikipedia dataset

+

As of Mahout v0.10.0, we are still reliant on the MapReduce versions of mahout seqwiki and mahout seq2sparse to extract and vectorize our text. A Spark implemenation of seq2sparse is in the works for Mahout v0.11. However, to download the wikipedia dataset, extract the bodies of the documentation, label each document and vectorize the text into TF-IDF vectors, we can sipmly run the wikipedia-classifier.sh example.

+
Please select a number to choose the corresponding task to run
+1. CBayes (may require increased heap space on yarn)
+2. BinaryCBayes
+3. clean -- cleans up the work area in /tmp/mahout-work-wiki
+Enter your choice :
+
+ + +

Enter (2). This will download a large recent XML dump of the wikipedia database, into a /tmp/mahout-work-wiki directory, unzip it and place it into HDFS. It will run a MapReduce job to parse the wikipedia set, extracting and labeling only pages with category tags for [United States] and [United Kingdom]. It will then run mahout seq2sparse to convert the documents into TF-IDF vectors. The script will also a build and test a Naive Bayes model using MapReduce. When it is completed, you should see a confusion matrix on your screen. For this tutorial, we will ignore the MapReduce model, and build a new model using Spark based on the vectorization data created by seq2sparse.

+

Getting Started

+

Launch the mahout-shell. There is an example script: spark-document-classifier.mscala (.mscala denotes a Mahout-Scala script which can be run similarly to an R-script). We will be walking through this script for this tutorial but if you wanted to simply run the script, you could just issue the command:

+
mahout> :load /path/to/mahout/examples/bin/spark-document-classifier.mscala
+
+ + +

For now, lets take the script apart piece by piece.

+

Imports

+

Our mahout Naive Bayes Imports:

+
import org.apache.mahout.classifier.naivebayes._
+import org.apache.mahout.classifier.stats._
+import org.apache.mahout.nlp.tfidf._
+
+ + +

Hadoop Imports needed to read our dictionary:

+
import org.apache.hadoop.io.Text
+import org.apache.hadoop.io.IntWritable
+import org.apache.hadoop.io.LongWritable
+
+ + +

read in our full set from HDFS as vectorized by seq2sparse in classify-wikipedia.sh

+
val pathToData = "/tmp/mahout-work-wiki/"
+val fullData = drmDfsRead(pathToData + "wikipediaVecs/tfidf-vectors")
+
+ + +

extract the category of each observation and aggregate those observation by category

+
val (labelIndex, aggregatedObservations) = SparkNaiveBayes.extractLabelsAndAggregateObservations(fullData)
+
+ + +

build a Muitinomial Naive Bayes model and self test on the training set

+
val model = SparkNaiveBayes.train(aggregatedObservations, labelIndex, false)
+val resAnalyzer = SparkNaiveBayes.test(model, fullData, false)
+println(resAnalyzer)
+
+ + +

printing the result analyzer will display the confusion matrix

+

read in the dictionary and document frequency count from HDFS

+
val dictionary = sdc.sequenceFile(pathToData + "wikipediaVecs/dictionary.file-0",
+                                  classOf[Text],
+                                  classOf[IntWritable])
+val documentFrequencyCount = sdc.sequenceFile(pathToData + "wikipediaVecs/df-count",
+                                              classOf[IntWritable],
+                                              classOf[LongWritable])
+
+// setup the dictionary and document frequency count as maps
+val dictionaryRDD = dictionary.map { 
+                                case (wKey, wVal) => wKey.asInstanceOf[Text]
+                                                         .toString() -> wVal.get() 
+                                   }
+
+val documentFrequencyCountRDD = documentFrequencyCount.map {
+                                        case (wKey, wVal) => wKey.asInstanceOf[IntWritable]
+                                                                 .get() -> wVal.get() 
+                                                           }
+
+val dictionaryMap = dictionaryRDD.collect.map(x => x._1.toString -> x._2.toInt).toMap
+val dfCountMap = documentFrequencyCountRDD.collect.map(x => x._1.toInt -> x._2.toLong).toMap
+
+ + +

define a function to tokeinze and vectorize new text using our current dictionary

+

For this simple example, our function ```vectorizeDocument(...) will tokenize a new document into unigrams using native Java String methods and vectorize usingour dictionary and document frequencies. You could also use a Lucene analyzer for bigrams, trigrams, etc., and integrate Apache Tika to extract text from different document types (PDF, PPT, XLS, etc.). Here, however we will kwwp ot simple and split ouor text using regexs and native String methods.

+
def vectorizeDocument(document: String,
+                        dictionaryMap: Map[String,Int],
+                        dfMap: Map[Int,Long]): Vector = {
+    val wordCounts = document.replaceAll("[^\\p{L}\\p{Nd}]+", " ")
+                                .toLowerCase
+                                .split(" ")
+                                .groupBy(identity)
+                                .mapValues(_.length)         
+    val vec = new RandomAccessSparseVector(dictionaryMap.size)
+    val totalDFSize = dfMap(-1)
+    val docSize = wordCounts.size
+    for (word <- wordCounts) {
+        val term = word._1
+        if (dictionaryMap.contains(term)) {
+            val tfidf: TermWeight = new TFIDF()
+            val termFreq = word._2
+            val dictIndex = dictionaryMap(term)
+            val docFreq = dfCountMap(dictIndex)
+            val currentTfIdf = tfidf.calculate(termFreq,
+                                               docFreq.toInt,
+                                               docSize,
+                                               totalDFSize.toInt)
+            vec.setQuick(dictIndex, currentTfIdf)
+        }
+    }
+    vec
+}
+
+ + +

setup our classifier

+
val labelMap = model.labelIndex
+val numLabels = model.numLabels
+val reverseLabelMap = labelMap.map(x => x._2 -> x._1)
+
+// instantiate the correct type of classifier
+val classifier = model.isComplementary match {
+    case true => new ComplementaryNBClassifier(model)
+    case _ => new StandardNBClassifier(model)
+}
+
+ + +

define an argmax function

+

The label with the higest score wins the classification for a given document

+
def argmax(v: Vector): (Int, Double) = {
+    var bestIdx: Int = Integer.MIN_VALUE
+    var bestScore: Double = Integer.MIN_VALUE.asInstanceOf[Int].toDouble
+    for(i <- 0 until v.size) {
+        if(v(i) > bestScore){
+            bestScore = v(i)
+            bestIdx = i
+        }
+    }
+    (bestIdx, bestScore)
+}
+
+ + +

define our final TF(-IDF) vector classifier

+
def classifyDocument(clvec: Vector) : String = {
+    val cvec = classifier.classifyFull(clvec)
+    val (bestIdx, bestScore) = argmax(cvec)
+    reverseLabelMap(bestIdx)
+}
+
+ + +

Two sample news articles: United States Football and United Kingdom Football

+
// A random United States football article
+// http://www.reuters.com/article/2015/01/28/us-nfl-superbowl-security-idUSKBN0L12JR20150128
+val UStextToClassify = new String("(Reuters) - Super Bowl security officials acknowledge" +
+    " the NFL championship game represents a high profile target on a world stage but are" +
+    " unaware of any specific credible threats against Sunday's showcase. In advance of" +
+    " one of the world's biggest single day sporting events, Homeland Security Secretary" +
+    " Jeh Johnson was in Glendale on Wednesday to review security preparations and tour" +
+    " University of Phoenix Stadium where the Seattle Seahawks and New England Patriots" +
+    " will battle. Deadly shootings in Paris and arrest of suspects in Belgium, Greece and" +
+    " Germany heightened fears of more attacks around the world and social media accounts" +
+    " linked to Middle East militant groups have carried a number of threats to attack" +
+    " high-profile U.S. events. There is no specific credible threat, said Johnson, who" + 
+    " has appointed a federal coordination team to work with local, state and federal" +
+    " agencies to ensure safety of fans, players and other workers associated with the" + 
+    " Super Bowl. I'm confident we will have a safe and secure and successful event." +
+    " Sunday's game has been given a Special Event Assessment Rating (SEAR) 1 rating, the" +
+    " same as in previous years, except for the year after the Sept. 11, 2001 attacks, when" +
+    " a higher level was declared. But security will be tight and visible around Super" +
+    " Bowl-related events as well as during the game itself. All fans will pass through" +
+    " metal detectors and pat downs. Over 4,000 private security personnel will be deployed" +
+    " and the almost 3,000 member Phoenix police force will be on Super Bowl duty. Nuclear" +
+    " device sniffing teams will be deployed and a network of Bio-Watch detectors will be" +
+    " set up to provide a warning in the event of a biological attack. The Department of" +
+    " Homeland Security (DHS) said in a press release it had held special cyber-security" +
+    " and anti-sniper training sessions. A U.S. official said the Transportation Security" +
+    " Administration, which is responsible for screening airline passengers, will add" +
+    " screeners and checkpoint lanes at airports. Federal air marshals, behavior detection" +
+    " officers and dog teams will help to secure transportation systems in the area. We" +
+    " will be ramping it (security) up on Sunday, there is no doubt about that, said Federal"+
+    " Coordinator Matthew Allen, the DHS point of contact for planning and support. I have" +
+    " every confidence the public safety agencies that represented in the planning process" +
+    " are going to have their best and brightest out there this weekend and we will have" +
+    " a very safe Super Bowl.")
+
+// A random United Kingdom football article
+// http://www.reuters.com/article/2015/01/26/manchester-united-swissquote-idUSL6N0V52RZ20150126
+val UKtextToClassify = new String("(Reuters) - Manchester United have signed a sponsorship" +
+    " deal with online financial trading company Swissquote, expanding the commercial" +
+    " partnerships that have helped to make the English club one of the richest teams in" +
+    " world soccer. United did not give a value for the deal, the club's first in the sector," +
+    " but said on Monday it was a multi-year agreement. The Premier League club, 20 times" +
+    " English champions, claim to have 659 million followers around the globe, making the" +
+    " United name attractive to major brands like Chevrolet cars and sportswear group Adidas." +
+    " Swissquote said the global deal would allow it to use United's popularity in Asia to" +
+    " help it meet its targets for expansion in China. Among benefits from the deal," +
+    " Swissquote's clients will have a chance to meet United players and get behind the scenes" +
+    " at the Old Trafford stadium. Swissquote is a Geneva-based online trading company that" +
+    " allows retail investors to buy and sell foreign exchange, equities, bonds and other asset" +
+    " classes. Like other retail FX brokers, Swissquote was left nursing losses on the Swiss" +
+    " franc after Switzerland's central bank stunned markets this month by abandoning its cap" +
+    " on the currency. The fallout from the abrupt move put rival and West Ham United shirt" +
+    " sponsor Alpari UK into administration. Swissquote itself was forced to book a 25 million" +
+    " Swiss francs ($28 million) provision for its clients who were left out of pocket" +
+    " following the franc's surge. United's ability to grow revenues off the pitch has made" +
+    " them the second richest club in the world behind Spain's Real Madrid, despite a" +
+    " downturn in their playing fortunes. United Managing Director Richard Arnold said" +
+    " there was still lots of scope for United to develop sponsorships in other areas of" +
+    " business. The last quoted statistics that we had showed that of the top 25 sponsorship" +
+    " categories, we were only active in 15 of those, Arnold told Reuters. I think there is a" +
+    " huge potential still for the club, and the other thing we have seen is there is very" +
+    " significant growth even within categories. United have endured a tricky transition" +
+    " following the retirement of manager Alex Ferguson in 2013, finishing seventh in the" +
+    " Premier League last season and missing out on a place in the lucrative Champions League." +
+    " ($1 = 0.8910 Swiss francs) (Writing by Neil Maidment, additional reporting by Jemima" + 
+    " Kelly; editing by Keith Weir)")
+
+ + +

vectorize and classify our documents

+
val usVec = vectorizeDocument(UStextToClassify, dictionaryMap, dfCountMap)
+val ukVec = vectorizeDocument(UKtextToClassify, dictionaryMap, dfCountMap)
+
+println("Classifying the news article about superbowl security (united states)")
+classifyDocument(usVec)
+
+println("Classifying the news article about Manchester United (united kingdom)")
+classifyDocument(ukVec)
+
+ + +

tie everything together in a new method to classify new text

+
def classifyText(txt: String): String = {
+    val v = vectorizeDocument(txt, dictionaryMap, dfCountMap)
+    classifyDocument(v)
+
+}
+
+ + +

now we can simply call our classifyText method on any string

+
classifyText("Hello world from Queens")
+classifyText("Hello world from London")
+
+
+
+
+
+
+

+ Copyright © 2014 The Apache Software Foundation, Licensed under + the Apache License, Version 2.0. +
+ Apache and the Apache feather logos are trademarks of The Apache Software Foundation. +

+
+
+ + + + + +