Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 17F85105F4 for ; Wed, 4 Sep 2013 09:45:57 +0000 (UTC) Received: (qmail 49770 invoked by uid 500); 4 Sep 2013 09:45:55 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 49731 invoked by uid 500); 4 Sep 2013 09:45:54 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 48885 invoked by uid 99); 4 Sep 2013 09:45:53 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Sep 2013 09:45:53 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of taner.diler@gmail.com designates 209.85.212.47 as permitted sender) Received: from [209.85.212.47] (HELO mail-vb0-f47.google.com) (209.85.212.47) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Sep 2013 09:45:46 +0000 Received: by mail-vb0-f47.google.com with SMTP id h10so47470vbh.20 for ; Wed, 04 Sep 2013 02:45:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=96QTKoa7leZOsVXbdFXg4IIQusQvDqFJ5qhpC45cRJA=; b=ZxiwDh1CwzNOpQXVzbDxSQgxcaCXfI7b0W/Fu62qF9QcN9SyC/HPvfGLVOe7GCdI4d d4iH9IimQw6G+lNARF4Wkx8V4q88H/iR9Ei1RrJhGbBpX0FkD+OEr+Z+LZXLhslgQZ1v vwrfwcSQBatjxPtekgITm2/nZxWcE7Z/mhSCwHdz/Pt/eOVTicY9A6yItbcBL05XdsCj Xn0bvDFyA3o7DV9Ftu/HWUuxq1ZoAr67as1aZgX8YzHvhLeUKEm7NosWey1rVFaoQWja wRSsv0+OWlb5SBOZVmXycGLmp72pvndi6EeNRyvZU1yE+uJNChyF7K6Tnsv93sYD2vqc B4QA== MIME-Version: 1.0 X-Received: by 10.52.117.44 with SMTP id kb12mr1601072vdb.8.1378287925250; Wed, 04 Sep 2013 02:45:25 -0700 (PDT) Received: by 10.221.55.202 with HTTP; Wed, 4 Sep 2013 02:45:25 -0700 (PDT) In-Reply-To: References: <1378052583.14550.YahooMailNeo@web163503.mail.gq1.yahoo.com> Date: Wed, 4 Sep 2013 12:45:25 +0300 Message-ID: Subject: Re: TFIDFConverter generates empty tfidf-vectors From: Taner Diler To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=bcaec548587c2ed28704e58baad6 X-Virus-Checked: Checked by ClamAV on apache.org --bcaec548587c2ed28704e58baad6 Content-Type: text/plain; charset=ISO-8859-1 mahout seq2sparse -i reuters-seqfiles/ -o reuters-kmeans-try -chunk 200 -wt tfidf -s 2 -md 5 -x 95 -ng 2 -ml 50 -n 2 -seq this command works well. Gokhan, I changed minLLR value to 1.0 in java but result is same empty tfidf-vectors. On Tue, Sep 3, 2013 at 10:47 AM, Taner Diler wrote: > Gokhan, I try it from commandline it works. I will send the command to > compare command line parameters to TFIDFConverter params. > > Suneel, I had checked the seqfiles. I didn't see any problem other > generated seqfiles but I will checked and send samples from each seqfiles. > > > On Sun, Sep 1, 2013 at 11:02 PM, Gokhan Capan wrote: > >> Suneel is right indeed. I assumed that everything performed prior to >> vector >> generation is done correctly. >> >> By the way, if the suggestions do not work, could you try running >> seq2sparse from commandline with the same arguments and see if that works >> well? >> >> On Sun, Sep 1, 2013 at 7:23 PM, Suneel Marthi > >wrote: >> >> > I would first check to see if the input 'seqfiles' for TFIDFGenerator >> have >> > any meat in them. >> > This could also happen if the input seqfiles are empty. >> >> >> > >> > >> > ________________________________ >> > From: Taner Diler >> > To: user@mahout.apache.org >> > Sent: Sunday, September 1, 2013 2:24 AM >> > Subject: TFIDFConverter generates empty tfidf-vectors >> > >> > >> > Hi all, >> > >> > I try to run Reuters KMeans example in Java, but TFIDFComverter >> generates >> > tfidf-vectors as empty. How can I fix that? >> > >> > private static int minSupport = 2; >> > private static int maxNGramSize = 2; >> > private static float minLLRValue = 50; >> > private static float normPower = 2; >> > private static boolean logNormalize = true; >> > private static int numReducers = 1; >> > private static int chunkSizeInMegabytes = 200; >> > private static boolean sequentialAccess = true; >> > private static boolean namedVectors = false; >> > private static int minDf = 5; >> > private static long maxDF = 95; >> > >> > Path inputDir = new Path("reuters-seqfiles"); >> > String outputDir = "reuters-kmeans-try"; >> > HadoopUtil.delete(conf, new Path(outputDir)); >> > StandardAnalyzer analyzer = new >> > StandardAnalyzer(Version.LUCENE_43); >> > Path tokenizedPath = new >> > Path(DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER); >> > DocumentProcessor.tokenizeDocuments(inputDir, >> > analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath, conf); >> > >> > >> > DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, >> new >> > Path(outputDir), >> > DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER, >> conf, >> > minSupport , maxNGramSize, minLLRValue, normPower , logNormalize, >> > numReducers , chunkSizeInMegabytes , sequentialAccess, namedVectors); >> > >> > >> > Pair> features = >> TFIDFConverter.calculateDF(new >> > Path(outputDir, >> > DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new >> > Path(outputDir), conf, chunkSizeInMegabytes); >> > TFIDFConverter.processTfIdf(new Path(outputDir, >> > DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new >> > Path(outputDir), conf, features, minDf , maxDF , normPower, >> logNormalize, >> > sequentialAccess, false, numReducers); >> > >> > > --bcaec548587c2ed28704e58baad6--