mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Drew Farris <drew.far...@gmail.com>
Subject Re: Mahout 0.3 Plan and other changes
Date Thu, 04 Feb 2010 16:59:19 GMT
On Thu, Feb 4, 2010 at 10:51 AM, Robin Anil <robin.anil@gmail.com> wrote:

>>
>> Document Directory -> Document Sequence File
>> Document Sequence File -> Document Token Streams
>> Document Token Streams -> Document Vectors + Dictionary
>>
> Ok I will work on this Job.

FWIW, Ted had proposed something on the order of allowing Documents to
have multiple named Fields, where each field has an independent token
stream. Likewise, Document sequence files could have multiple fields
per Document where each field is a string. What do you think about
something like this? The documents I work with day to day in
production are more frequently field structured than flat and in some
cases fields are tokenized while others are simply untouched. I

> Also partial Vector merger could be reused by colloc when creating ngram
> only vectors. But we need to keep adding to the dictionary file. If you can
> work on a dictionary merger + chunker, it will be great. I think we can do
> this integration quickly

I'll take a closer look at the Dictionary code you're produced and see
what I can come up with -- is the basic idea here to take multiple
dictionaries with potentially overlapping ID's and merge them into a
single dictionary? What needs to happen with regards to chunking?

Drew

Mime
View raw message