Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (nike.apache.org: domain of taner.diler@gmail.com
 designates 209.85.212.47 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAOJiTsujtGnDq7FuaJtuSdyjpCmjy8B0vGFT5XrbFxiA++-7mQ@mail.gmail.com>
References: 
 <CAOJiTssL+bLEH-6wLdHncusZcZXuG3zTOdauWx_s5oWV2jjr7g@mail.gmail.com>
	<1378052583.14550.YahooMailNeo@web163503.mail.gq1.yahoo.com>
	<CAFj5n5EiA0TBGhD+_=dtoxtjpZ+jumvsaUQ6AtNU8RJ5i+48og@mail.gmail.com>
	<CAOJiTsujtGnDq7FuaJtuSdyjpCmjy8B0vGFT5XrbFxiA++-7mQ@mail.gmail.com>
Date: Wed, 4 Sep 2013 12:45:25 +0300
Message-ID: 
 <CAOJiTstY2sA605MbgpZy6todgM=QcAf-j7JedGaszG1UHpZv4g@mail.gmail.com>
Subject: Re: TFIDFConverter generates empty tfidf-vectors
From: Taner Diler <taner.diler@gmail.com>
To: user@mahout.apache.org
Content-Type: multipart/alternative; boundary=bcaec548587c2ed28704e58baad6

--bcaec548587c2ed28704e58baad6
Content-Type: text/plain; charset=ISO-8859-1

mahout seq2sparse -i reuters-seqfiles/ -o reuters-kmeans-try -chunk 200 -wt
tfidf -s 2 -md 5 -x 95 -ng 2 -ml 50 -n 2 -seq

this command works well.

Gokhan, I changed minLLR value to 1.0 in java but result is same empty
tfidf-vectors.


On Tue, Sep 3, 2013 at 10:47 AM, Taner Diler <taner.diler@gmail.com> wrote:

> Gokhan, I try it from commandline it works. I will send the command to
> compare command line parameters to TFIDFConverter params.
>
> Suneel, I had checked the seqfiles. I didn't see any problem other
> generated seqfiles but I will checked  and send samples from each seqfiles.
>
>
> On Sun, Sep 1, 2013 at 11:02 PM, Gokhan Capan <gkhncpn@gmail.com> wrote:
>
>> Suneel is right indeed. I assumed that everything performed prior to
>> vector
>> generation is done correctly.
>>
>> By the way, if the suggestions do not work, could you try running
>> seq2sparse from commandline with the same arguments and see if that works
>> well?
>>
>> On Sun, Sep 1, 2013 at 7:23 PM, Suneel Marthi <suneel_marthi@yahoo.com
>> >wrote:
>>
>> > I would first check to see if the input 'seqfiles' for TFIDFGenerator
>> have
>> > any meat in them.
>> > This could also happen if the input seqfiles are empty.
>>
>>
>> >
>> >
>> > ________________________________
>> >  From: Taner Diler <taner.diler@gmail.com>
>> > To: user@mahout.apache.org
>> > Sent: Sunday, September 1, 2013 2:24 AM
>> > Subject: TFIDFConverter generates empty tfidf-vectors
>> >
>> >
>> > Hi all,
>> >
>> > I try to run Reuters KMeans example in Java, but TFIDFComverter
>> generates
>> > tfidf-vectors as empty. How can I fix that?
>> >
>> >     private static int minSupport = 2;
>> >     private static int maxNGramSize = 2;
>> >     private static float minLLRValue = 50;
>> >     private static float normPower = 2;
>> >     private static boolean logNormalize = true;
>> >     private static int numReducers = 1;
>> >     private static int chunkSizeInMegabytes = 200;
>> >     private static boolean sequentialAccess = true;
>> >     private static boolean namedVectors = false;
>> >     private static int minDf = 5;
>> >     private static long maxDF = 95;
>> >
>> >         Path inputDir = new Path("reuters-seqfiles");
>> >         String outputDir = "reuters-kmeans-try";
>> >         HadoopUtil.delete(conf, new Path(outputDir));
>> >         StandardAnalyzer analyzer = new
>> > StandardAnalyzer(Version.LUCENE_43);
>> >         Path tokenizedPath = new
>> > Path(DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
>> >         DocumentProcessor.tokenizeDocuments(inputDir,
>> > analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath, conf);
>> >
>> >
>> >         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
>> new
>> > Path(outputDir),
>> >                 DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER,
>> conf,
>> > minSupport , maxNGramSize, minLLRValue, normPower , logNormalize,
>> > numReducers , chunkSizeInMegabytes , sequentialAccess, namedVectors);
>> >
>> >
>> >         Pair<Long[], List<Path>> features =
>> TFIDFConverter.calculateDF(new
>> > Path(outputDir,
>> >                 DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
>> > Path(outputDir), conf, chunkSizeInMegabytes);
>> >         TFIDFConverter.processTfIdf(new Path(outputDir,
>> >                 DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
>> > Path(outputDir), conf, features, minDf , maxDF , normPower,
>> logNormalize,
>> > sequentialAccess, false, numReducers);
>> >
>>
>
>

--bcaec548587c2ed28704e58baad6--