mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stamatis Rapanakis <stamrapana...@gmail.com>
Subject Extracting the topics of documents (LDA, Mahout 0.7)
Date Thu, 06 Feb 2014 13:56:22 GMT
  I am trying to run the LDA algorithm. I can create meaningful topics but
the document/topic assignment is of very bad quality.

  I have assigned 30 tweets to the following 10 topics:

/grammy awards
/greek crisis
/greek islands
/premier inn
/premier league
/rihanna
/syria
/terrorism
/winter olympics
/winter sales

  I have a total of 300 tweets and my purpose is to run the LDA algorithm
to see how well these tweets are assigned. For example, if the number of
topics parameter is set to 10, how much do they match to the original
assignment.

1. I start by creating a file that will contain (in random order) the
tweets (*tweets.tsv*). This file will be used to compare the final tweets
topic assignment.

2. I remove stopwords, urls, replies and create a file with the tweets text
only (*tweets_no_stopwords.tsv*). One tweet (document) per file line. This
will be the LDA input file.

3. I use some java code to create a sequence file from
*tweets_no_stopwords.tsv.* I use a SequenceFile.Writer object with key an
integer and value the tweet text (extract attached tweets_no_stopwords.rar
that contains a chunk-0 file).

 By executing the command: *mahout seqdumper -i tweets_no_stopwords/chunk-0*
the chunk-0 file contents appear correctly:

*Key: 1: Value: #nowplaying Rihanna - Unfaithful !! π?’™ trop belle !!*
*Key: 2: Value: Grammy Awards Hairstyles: Memorable Moments*
*...*
*Key: 299: Value: team scored goal matches! (Man City)*
*Key: 300: Value: Rocsi Diaz Wearing 5th Mercer- Grammy Awards*

4. I convert the data to vectors:

bin/mahout seq2sparse -i tweets_no_stopwords -o tweets_no_stopwords-vectors
-ow

(I review the file with the command: *bin/mahout seqdumper -i
tweets_no_stopwords-vectors/tf-vectors/part-r-00000*)

5. I convert keys to IntWritables

bin/mahout rowid -i tweets_no_stopwords-vectors/tf-vectors/ -o
tweets_no_stopwords-vectors/tf-vectors-cvb

The created tf-vectors-cvb/docIndex, tf-vectors-cvb/matrix files have keys
from 0 - 299 (300 instances).

6. Finally I run the LDA algorithm:

*bin/mahout cvb -i tweets_no_stopwords-vectors/tf-vectors-cvb/matrix/ -o
lda_output/topicterm -mt lda_output/models -dt lda_output/docTopics -k 10
-x 40 -dict tweets_no_stopwords-vectors/dictionary.file-0*

Note: I have to enter Cltr+C to stop the command execution (after it
finished and the message "Program took XXXX ms" appears). But the folders
are created as expected.

The topics created (lda_output/topicterm) seem fine. I execute the command:

*bin/mahout vectordump -i lda_output/topicterm -d
tweets_no_stopwords-vectors/dictionary.file-0 -dt sequencefile -c csv -p
true -o p_term_topic.txt -sort lda_output/topicterm -vs 10*

and follow the steps described in this link (
http://sujitpal.blogspot.gr/2013/10/topic-modeling-with-mahout-on-amazon-emr.html)
to create a file *p_term_topic.txt* and show a report with the output.

*Topic 0**Topic 1**Topic 2**Topic 3**Topic 4*winter, sales, olympics, love,
played, people, big, photo, sale, trailterrorism, grammy, awards,
blaindianexus, 56th, balochistan, bla, rock, 2014, photosislands, greek,
greece, travel, find, book, make, kea, days, holidaygreek, crisis, β, lol,
s, top, economic, tomorrow, job, eugrammys, found, style, red,
hairdressers, room, mata, good, ty, walks*Topic5**Topic 6**Topic
7**Topic 8**Topic
9*sochi, team, time, all, usa, war, free, syria, sending, checksyria, city,
manchester, united, back, hit, watching, chelsea, week, matchdaysyria,
support, olympic, economy, video, today, competition, arab, u.s, inn'srihanna,
time, watch, unapologetic, follow, great, euro, congrats, bet, hotelspremier,
inn, league, stay, season, β, year, home, goals, won



These results are good, if you have in mind the (10) categories they
belonged to:

/grammy awards
/greek crisis
/greek islands
/premier inn
/premier league
/rihanna
/syria
/terrorism
/winter olympics
/winter sales

But the results in the folder *lda_output/docTopics* are really bad!

bin/mahout seqdumper -i lda_output/docTopics/part-m-00000  (Display the
results)

Key: 0: Value:
{0:2.7932644743653218E-5,1:0.2582390963222569,2:0.03389979994715306,3:0.16986766822778876,4:
*0.5144069716184998*
,5:6.134281324000599E-5,6:0.022817498374309925,7:1.2427551415773865E-4,8:4.7632128287483606E-4,9:7.909325497553191E-5}
Key: 1: Value:
{0:0.004101560509130678,1:0.02531905947518225,2:0.14528444920763148,3:
*0.32904199007739116*
,4:0.06024210378042988,5:0.15510210839789676,6:0.0364093686560865,7:0.13256015086012124,8:0.0613456311044372,9:0.05059357793169288}
Key: 2: Value:
{0:2.093051210521087E-4,1:0.0242076645518674,2:0.12014785226603218,3:0.15589333731396188,4:0.022516226489811282,5:0.015141667919690474,6:0.08494844406302673,7:0.150039462386397,8:0.15927498562672762,9
*:0.2676210542614334*}


*Tweet**Topic**Tweet text*14#nowplaying Rihanna Unfaithful !! �?�� trop
belle !!23Grammy Awards Hairstyles: Memorable Moments39Preeminent
#terrorism research center website. Check out: cc


 Am I missing something? Doesn't key 0 correspond to the first tweet
(document), key 2 to the second tweet and so on?

  Thank you in advance for your responses.

Mime
View raw message