mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel McEnnis <dmcen...@gmail.com>
Subject Re: Classification with data from Lucene
Date Tue, 05 Apr 2011 21:10:23 GMT
David,

Its actually not text to classify for the Bayes classifier but
tokenized words.  No punctuation and tokens separated by a space. One
file per line with the classification starting every line.  I hope
this helps...

Daniel.

On Tue, Apr 5, 2011 at 4:49 PM, David Croley <dcroley@renewdata.com> wrote:
> I'm not too worried about splitting the data into test and train sets. My main issue
is that the classifier examples I can find all take as input a file with the form (at least
for text):
>
> <label>\t<text to classifiy...>
>
> However, I don't have the original content of the files, only the index with term frequency
vectors. I know the first step for the Bayesian algorithms is creating a TF-IDF vector, but
is seems the existing code cannot take TF-IDF vectors like the cluster algorithms or even
some variant of the Term Frequency vectors I can get from Lucene.
>
> At this point, I am going to try to write code to dump the words and frequencies from
the index, add a label, and modify the BayesFeatureDriver class to take my input.
>
> David
>
>
> -----Original Message-----
> From: Lance Norskog [mailto:goksron@gmail.com]
> Sent: Tuesday, April 05, 2011 3:19 PM
> To: user@mahout.apache.org
> Subject: Re: Classification with data from Lucene
>
> The Lucene intake does not support searches on the index.
>
> If you can make a copies of the index, here's a trick: delete the
> documents you don't want, then optimize the index. You will need a
> Lucene program to do this.
> Use this to separate the big index into training and test indexes.
>
> On Mon, Apr 4, 2011 at 6:51 PM, David Croley <dcroley@renewdata.com> wrote:
>> I have a large Lucene index (with TermFreq vectors). I do not have easy
>> access to the original source docs that the index was made from. I have
>> identified a set of docs in the index as Category X. Is there a way to
>> run Mahout's Bayesian classification algorithm, trained on the docs in
>> Category X, on the remaining docs in the index to better indentify
>> category matches?
>>
>>
>>
>> I have also exported the Lucene data into a Vector file in prep to run
>> some clustering experiments (as per the wiki examples) and also wondered
>> if that data could be used to feed the CBayes code. From what I can
>> tell, the classification code in Mahout takes a completely different
>> form of input compared to the clustering algorithms.
>>
>>
>>
>> Thanks for any pointers.
>>
>>
>>
>>
>>
>> David Croley
>>
>> Lead Engineer
>>
>> RenewData
>>
>> 512.351.0198 BlackBerry
>>
>> 512.276.5518 Desk
>>
>> dcroley@renewdata.com
>>
>> www.renewdata.com <http://www.renewdata.com/>
>>
>>
>>
>> Global in reach. Local in focus.
>>
>>
>>
>>
>>
>> Confidentiality Notice: This electronic communication contained in this e-mail from
dcroley@renewdata.com (including any attachments) may contain privileged and/or confidential
information. This communication is intended only for the use of indicated e-mail addressees.
Please be advised that any disclosure, dissemination, distribution, copying, or other use
of this communication or any attached document other than for the purpose intended by the
sender is strictly prohibited. If you have received this communication in error, please notify
the sender immediately by reply e-mail and promptly destroy all electronic and printed copies
of this communication and any attached document. Thank you in advance for your cooperation.
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>
>
> Confidentiality Notice: This electronic communication contained in this e-mail from dcroley@renewdata.com
(including any attachments) may contain privileged and/or confidential information. This communication
is intended only for the use of indicated e-mail addressees. Please be advised that any disclosure,
dissemination, distribution, copying, or other use of this communication or any attached document
other than for the purpose intended by the sender is strictly prohibited. If you have received
this communication in error, please notify the sender immediately by reply e-mail and promptly
destroy all electronic and printed copies of this communication and any attached document.
Thank you in advance for your cooperation.
>

Mime
View raw message