Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7072F62EF for ; Fri, 15 Jul 2011 14:56:29 +0000 (UTC) Received: (qmail 87502 invoked by uid 500); 15 Jul 2011 14:56:28 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 87474 invoked by uid 500); 15 Jul 2011 14:56:28 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 87466 invoked by uid 99); 15 Jul 2011 14:56:28 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Jul 2011 14:56:28 +0000 X-ASF-Spam-Status: No, hits=-0.6 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of guxiaobo1982@gmail.com designates 209.85.160.170 as permitted sender) Received: from [209.85.160.170] (HELO mail-gy0-f170.google.com) (209.85.160.170) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Jul 2011 14:56:20 +0000 Received: by gyb13 with SMTP id 13so1009891gyb.1 for ; Fri, 15 Jul 2011 07:55:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=LsWrpPR/pO8OqY5GorvsQu5gWelyTaB1TEEgSjfHEc8=; b=ueSHRluD4lozkzkmmnK97fjeuMhCrCFyHKr3qDipjrDrf2MuwiWRPWdERKcPSQgqfl IaunkuRjXxYuDqRwQoBM3nrop00i0lJFgJkjE0SxJAC/cyOqYc3xJ4c/B3RiKWJOmEI5 xQKEYR8ec8AsvG6W2Djjpd0QSJ3FQXZgHW1dw= MIME-Version: 1.0 Received: by 10.236.187.1 with SMTP id x1mr844848yhm.358.1310741758221; Fri, 15 Jul 2011 07:55:58 -0700 (PDT) Received: by 10.236.36.99 with HTTP; Fri, 15 Jul 2011 07:55:58 -0700 (PDT) In-Reply-To: References: Date: Fri, 15 Jul 2011 22:55:58 +0800 Message-ID: Subject: Re: File format question about Random forest. From: Xiaobo Gu To: dev@mahout.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Do the -p and -f option of org.apache.mahout.df.tools.Describe have to be HDFS URLs, can they be local file system paths? On Fri, Jul 15, 2011 at 9:28 PM, Xiaobo Gu wrote: > Can we make the file descriptor as following: > > 1. make a small csv file with the same format as the actual dataset, > say a CSV file with header and only one record, > 2. Use java weka.core.converters.CSVLoader filename.csv > > filename.arff =A0to convert the small CSV into a ARFF file, see > http://maya.cs.depaul.edu/classes/ect584/weka/preprocess.html > 3. Use org.apache.mahout.df.tools.Describe =A0to generate a descriptor > > > The only consern here is: does the small CSV file with one record > sufficient enough to generate the ARFF file header, or do we have to > use the whole file to avoid losing information? > > > Xiaobo Gu > > > > > On Fri, Jul 15, 2011 at 9:10 PM, Xiaobo Gu wrote= : >> But if we use CSV files, how can we generate descriptors for datasets? >> >> Cheers >> >> Xiaobo Gu >> >> On Thu, Jul 14, 2011 at 1:27 AM, deneche abdelhakim = wrote: >>> I guess yes. as long as you don't use quotes or double quotes to embed = the >>> fields. >>> >>> On Wed, Jul 13, 2011 at 2:58 PM, Xiaobo Gu wro= te: >>> >>>> So for simple datasets, which only have numeric and character >>>> lable(without blank) category columns, =A0can we just use CSV tools to >>>> save it as a standard CSV file without header? >>>> >>>> >>>> On Wed, Jul 13, 2011 at 3:53 AM, deneche abdelhakim >>>> wrote: >>>> > the current implementation doesn't support the ARFF format >>>> out-of-the-box, >>>> > as described in the Wiki you need to remove the header of the file a= nd >>>> leave >>>> > only the data. Actually, this implementation is fully compatible wit= h >>>> UCI's >>>> > datasets which are comma separated text files. You'll also need to c= all >>>> the >>>> > dataset description tool (see the wiki) in order to generate a prope= r >>>> > description file (contains the nature of each attribute: Numerical o= r >>>> > Categorical). >>>> > >>>> > Yes you can use BuildForest and TestForest to generate and use Rando= m >>>> forest >>>> > models from the command line >>>> > >>>> > On Tue, Jul 12, 2011 at 2:19 PM, Xiaobo Gu >>>> wrote: >>>> > >>>> >> Hi, >>>> >> >>>> >> The Random Forest partial implementation in >>>> >> >>>> https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementat= ion >>>> >> use the ARFF file format, is ARFF the only supportted file format w= hen >>>> >> using the BuildForest and TestForest program, and are BuildForest a= nd >>>> >> TestForest program are official tools to build Random Forest models >>>> >> from the command line? >>>> >> >>>> >> Regards, >>>> >> >>>> >> Xiaobo Gu >>>> >> >>>> > >>>> >>> >> >