mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From deneche abdelhakim <adene...@gmail.com>
Subject Re: File format question about Random forest.
Date Sat, 16 Jul 2011 06:24:55 GMT
using the Describe tool, the partial implementation Wiki page explains how
to use it. And yes the descriptor file must be supplied

On Sat, Jul 16, 2011 at 5:57 AM, Xiaobo Gu <guxiaobo1982@gmail.com> wrote:

> But if I just use CSV file, how can I generate the descriptor file,
> does descriptor file must be supplied for BuildForest and TestForest?
>
>
> On Sat, Jul 16, 2011 at 5:39 AM, deneche abdelhakim <adeneche@gmail.com>
> wrote:
> > you don't need to convert the CSV file to ARFF, you can use it right
> away.
> >
> > you can use a small dataset as long as all values of categorical
> attributes
> > are available in the dataset
> >
> > On Fri, Jul 15, 2011 at 2:28 PM, Xiaobo Gu <guxiaobo1982@gmail.com>
> wrote:
> >
> >> Can we make the file descriptor as following:
> >>
> >> 1. make a small csv file with the same format as the actual dataset,
> >> say a CSV file with header and only one record,
> >> 2. Use java weka.core.converters.CSVLoader filename.csv >
> >> filename.arff  to convert the small CSV into a ARFF file, see
> >> http://maya.cs.depaul.edu/classes/ect584/weka/preprocess.html
> >> 3. Use org.apache.mahout.df.tools.Describe  to generate a descriptor
> >>
> >>
> >> The only consern here is: does the small CSV file with one record
> >> sufficient enough to generate the ARFF file header, or do we have to
> >> use the whole file to avoid losing information?
> >>
> >>
> >> Xiaobo Gu
> >>
> >>
> >>
> >>
> >> On Fri, Jul 15, 2011 at 9:10 PM, Xiaobo Gu <guxiaobo1982@gmail.com>
> wrote:
> >> > But if we use CSV files, how can we generate descriptors for datasets?
> >> >
> >> > Cheers
> >> >
> >> > Xiaobo Gu
> >> >
> >> > On Thu, Jul 14, 2011 at 1:27 AM, deneche abdelhakim <
> adeneche@gmail.com>
> >> wrote:
> >> >> I guess yes. as long as you don't use quotes or double quotes to
> embed
> >> the
> >> >> fields.
> >> >>
> >> >> On Wed, Jul 13, 2011 at 2:58 PM, Xiaobo Gu <guxiaobo1982@gmail.com>
> >> wrote:
> >> >>
> >> >>> So for simple datasets, which only have numeric and character
> >> >>> lable(without blank) category columns,  can we just use CSV tools
to
> >> >>> save it as a standard CSV file without header?
> >> >>>
> >> >>>
> >> >>> On Wed, Jul 13, 2011 at 3:53 AM, deneche abdelhakim <
> >> adeneche@gmail.com>
> >> >>> wrote:
> >> >>> > the current implementation doesn't support the ARFF format
> >> >>> out-of-the-box,
> >> >>> > as described in the Wiki you need to remove the header of
the file
> >> and
> >> >>> leave
> >> >>> > only the data. Actually, this implementation is fully compatible
> with
> >> >>> UCI's
> >> >>> > datasets which are comma separated text files. You'll also
need to
> >> call
> >> >>> the
> >> >>> > dataset description tool (see the wiki) in order to generate
a
> proper
> >> >>> > description file (contains the nature of each attribute: Numerical
> or
> >> >>> > Categorical).
> >> >>> >
> >> >>> > Yes you can use BuildForest and TestForest to generate and
use
> Random
> >> >>> forest
> >> >>> > models from the command line
> >> >>> >
> >> >>> > On Tue, Jul 12, 2011 at 2:19 PM, Xiaobo Gu <
> guxiaobo1982@gmail.com>
> >> >>> wrote:
> >> >>> >
> >> >>> >> Hi,
> >> >>> >>
> >> >>> >> The Random Forest partial implementation in
> >> >>> >>
> >> >>>
> >>
> https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation
> >> >>> >> use the ARFF file format, is ARFF the only supportted
file format
> >> when
> >> >>> >> using the BuildForest and TestForest program, and are
BuildForest
> >> and
> >> >>> >> TestForest program are official tools to build Random
Forest
> models
> >> >>> >> from the command line?
> >> >>> >>
> >> >>> >> Regards,
> >> >>> >>
> >> >>> >> Xiaobo Gu
> >> >>> >>
> >> >>> >
> >> >>>
> >> >>
> >> >
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message