mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <robin.a...@gmail.com>
Subject Re: SequenceFilesFromDirectory
Date Mon, 06 Jun 2011 17:28:09 GMT
Mark you need to write your own tool to convert data into sequence files.
Its pretty easy. instantiate SequenceFile.Writer with both key and value as
Text and write your data in the file.

If your data is very large, you might want to consider writing a Map only
MapReduce which can read your input and write Output <Text,Text> in
SequenceFileOutputFormat

Robin

On Mon, Jun 6, 2011 at 10:53 PM, Mark <static.void.dev@gmail.com> wrote:

> I am looking to performing clustering algorithms on these documents which I
> thought (I could be wrong) requires sequence files? Is this not the case?
>
> Thanks
>
>
> On 6/6/11 10:11 AM, Daniel McEnnis wrote:
>
>> Mark,
>>
>> Generally speaking, Mahout has pretty good performance over log files
>> like the ones your describing, so they typically don't get changed
>> into sequence files.  You'll need to write one for yourself if you
>> really need sequence files (such as for key management.)
>>
>> Daniel.
>>
>> On Mon, Jun 6, 2011 at 12:04 PM, Mark<static.void.dev@gmail.com>  wrote:
>>
>>> I've been running through the examples as described in the Mahout In
>>> Action
>>> book and I have some questions regarding the
>>> SequenceFilesFromDirectory.java
>>> class.
>>>
>>> This class expects a directory of files that contains 1 document per
>>> file.
>>> Is there another mahout class or some options I can supply to
>>> SequenceFilesFromDirectory.java to parse multiple documents per file? For
>>> example, my files contain 1 document per line. I would like to parse each
>>> line of each file and create a sequence file from this. Is this possible
>>> with SequenceFilesFromDirectory or would I have to write this myself?
>>>
>>> Thanks
>>>
>>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message