mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark <static.void....@gmail.com>
Subject Re: SequenceFilesFromDirectory
Date Mon, 06 Jun 2011 17:30:59 GMT
Thanks

On 6/6/11 10:28 AM, Robin Anil wrote:
> Mark you need to write your own tool to convert data into sequence files.
> Its pretty easy. instantiate SequenceFile.Writer with both key and value as
> Text and write your data in the file.
>
> If your data is very large, you might want to consider writing a Map only
> MapReduce which can read your input and write Output<Text,Text>  in
> SequenceFileOutputFormat
>
> Robin
>
> On Mon, Jun 6, 2011 at 10:53 PM, Mark<static.void.dev@gmail.com>  wrote:
>
>> I am looking to performing clustering algorithms on these documents which I
>> thought (I could be wrong) requires sequence files? Is this not the case?
>>
>> Thanks
>>
>>
>> On 6/6/11 10:11 AM, Daniel McEnnis wrote:
>>
>>> Mark,
>>>
>>> Generally speaking, Mahout has pretty good performance over log files
>>> like the ones your describing, so they typically don't get changed
>>> into sequence files.  You'll need to write one for yourself if you
>>> really need sequence files (such as for key management.)
>>>
>>> Daniel.
>>>
>>> On Mon, Jun 6, 2011 at 12:04 PM, Mark<static.void.dev@gmail.com>   wrote:
>>>
>>>> I've been running through the examples as described in the Mahout In
>>>> Action
>>>> book and I have some questions regarding the
>>>> SequenceFilesFromDirectory.java
>>>> class.
>>>>
>>>> This class expects a directory of files that contains 1 document per
>>>> file.
>>>> Is there another mahout class or some options I can supply to
>>>> SequenceFilesFromDirectory.java to parse multiple documents per file? For
>>>> example, my files contain 1 document per line. I would like to parse each
>>>> line of each file and create a sequence file from this. Is this possible
>>>> with SequenceFilesFromDirectory or would I have to write this myself?
>>>>
>>>> Thanks
>>>>
>>>>

Mime
View raw message