incubator-hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Jungblut <thomas.jungb...@googlemail.com>
Subject Re: InputFormats for Hama
Date Sun, 25 Mar 2012 13:10:29 GMT
>
> I can open a JIRA. I need input on what all InputFormat makes sense and the
> their priority. Some we can port from Hadoop.


Yep, you're right. I guess a single JIRA would be enough for the already
implemented formats in Hadoop, for the others we need subclasses.
Formats that I really wanted to have would be:

   - DBInputFormat[1]
   - XMLInputFormat
   - NLineInputFormat
   - CSVInputFormat (we could use OpenCSV for that in conjunction with
   TextInputFormat)
   - JSONInputFormat (for OpenGraph stuff)
   - The graph DB formats Neo4J and how the others are called

Anything I missed for a "full" coverage?

Could you please elaborate on this?


Sure, DMOZ is some kind of crawled website database. It is used in some
pagerank examples to test it, don't know if it was in Mahout. We could also
use it since we have pagerank as well.
CommonCrawl is a new up-coming DMOZ-like database of many crawled sites, it
is hosted on S3 in Amazon Cloud. We run on EC2 via Whirr so this could be a
cool example as well.

[1]
http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.html


Am 25. März 2012 14:56 schrieb Praveen Sripati <praveensripati@gmail.com>:

> Thomas et al,
>
> > Would someone please open JIRAs for that?
>
> I can open a JIRA. I need input on what all InputFormat makes sense and the
> their priority. Some we can port from Hadoop.
>
> > Based on XML we can implement a format that parses DMOZ or commoncrawl on
> Amzon S3.
>
> Could you please elaborate on this?
>
> Praveen
>
>
> On Sun, Mar 25, 2012 at 5:14 PM, Chia-Hung Lin <clin4j@googlemail.com
> >wrote:
>
> > As I understand, many iterative applications don't require key value
> > input/ output and additionally need random access (read/ write) to
> > particular file. I/O interface e.g. mpi may increase flexibility here.
> >
> > https://issues.apache.org/jira/browse/MAPREDUCE-2911
> >
> > On 25 March 2012 10:01, Praveen Sripati <praveensripati@gmail.com>
> wrote:
> > > Hi,
> > >
> > > For Hama there are limited input formats
> > >
> > > CombineFileInputFormat, FileInputFormat, NullInputFormat,
> > > SequenceFileInputFormat, TextInputFormat
> > >
> > > Does it make sense to have to have more input formats? I was thinking
> > > InputFormats for Graph Databases.
> > >
> > > Any feedback for the different input formats is welcome.
> > >
> > > I quickly glanced Giraph and Hadoop and they have more InputFormats
> which
> > > makes it easy to plug them with external systems.
> > >
> > > Praveen
> >
>



-- 
Thomas Jungblut
Berlin <thomas.jungblut@gmail.com>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message