mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiaobo Gu <guxiaobo1...@gmail.com>
Subject Re: File format question about Random forest.
Date Fri, 15 Jul 2011 14:55:58 GMT
Do the -p and -f option of org.apache.mahout.df.tools.Describe have to
be HDFS URLs, can they be local file system paths?


On Fri, Jul 15, 2011 at 9:28 PM, Xiaobo Gu <guxiaobo1982@gmail.com> wrote:
> Can we make the file descriptor as following:
>
> 1. make a small csv file with the same format as the actual dataset,
> say a CSV file with header and only one record,
> 2. Use java weka.core.converters.CSVLoader filename.csv >
> filename.arff  to convert the small CSV into a ARFF file, see
> http://maya.cs.depaul.edu/classes/ect584/weka/preprocess.html
> 3. Use org.apache.mahout.df.tools.Describe  to generate a descriptor
>
>
> The only consern here is: does the small CSV file with one record
> sufficient enough to generate the ARFF file header, or do we have to
> use the whole file to avoid losing information?
>
>
> Xiaobo Gu
>
>
>
>
> On Fri, Jul 15, 2011 at 9:10 PM, Xiaobo Gu <guxiaobo1982@gmail.com> wrote:
>> But if we use CSV files, how can we generate descriptors for datasets?
>>
>> Cheers
>>
>> Xiaobo Gu
>>
>> On Thu, Jul 14, 2011 at 1:27 AM, deneche abdelhakim <adeneche@gmail.com> wrote:
>>> I guess yes. as long as you don't use quotes or double quotes to embed the
>>> fields.
>>>
>>> On Wed, Jul 13, 2011 at 2:58 PM, Xiaobo Gu <guxiaobo1982@gmail.com> wrote:
>>>
>>>> So for simple datasets, which only have numeric and character
>>>> lable(without blank) category columns,  can we just use CSV tools to
>>>> save it as a standard CSV file without header?
>>>>
>>>>
>>>> On Wed, Jul 13, 2011 at 3:53 AM, deneche abdelhakim <adeneche@gmail.com>
>>>> wrote:
>>>> > the current implementation doesn't support the ARFF format
>>>> out-of-the-box,
>>>> > as described in the Wiki you need to remove the header of the file and
>>>> leave
>>>> > only the data. Actually, this implementation is fully compatible with
>>>> UCI's
>>>> > datasets which are comma separated text files. You'll also need to call
>>>> the
>>>> > dataset description tool (see the wiki) in order to generate a proper
>>>> > description file (contains the nature of each attribute: Numerical or
>>>> > Categorical).
>>>> >
>>>> > Yes you can use BuildForest and TestForest to generate and use Random
>>>> forest
>>>> > models from the command line
>>>> >
>>>> > On Tue, Jul 12, 2011 at 2:19 PM, Xiaobo Gu <guxiaobo1982@gmail.com>
>>>> wrote:
>>>> >
>>>> >> Hi,
>>>> >>
>>>> >> The Random Forest partial implementation in
>>>> >>
>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation
>>>> >> use the ARFF file format, is ARFF the only supportted file format
when
>>>> >> using the BuildForest and TestForest program, and are BuildForest
and
>>>> >> TestForest program are official tools to build Random Forest models
>>>> >> from the command line?
>>>> >>
>>>> >> Regards,
>>>> >>
>>>> >> Xiaobo Gu
>>>> >>
>>>> >
>>>>
>>>
>>
>

Mime
View raw message