incubator-hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Praveen Sripati <praveensrip...@gmail.com>
Subject Re: InputFormats for Hama
Date Thu, 29 Mar 2012 16:19:26 GMT
It would be nice to use the same Hadoop core classes instead of copying the
code into Hama. Same applies to InputFormat and other classes also. Hama
would be getting effort free updates.

A generic Input/Output format can be applicable to MR, BSP and other
distributed models also.

Praveen

On Sun, Mar 25, 2012 at 6:40 PM, Thomas Jungblut <
thomas.jungblut@googlemail.com> wrote:

> >
> > I can open a JIRA. I need input on what all InputFormat makes sense and
> the
> > their priority. Some we can port from Hadoop.
>
>
> Yep, you're right. I guess a single JIRA would be enough for the already
> implemented formats in Hadoop, for the others we need subclasses.
> Formats that I really wanted to have would be:
>
>   - DBInputFormat[1]
>   - XMLInputFormat
>   - NLineInputFormat
>   - CSVInputFormat (we could use OpenCSV for that in conjunction with
>   TextInputFormat)
>   - JSONInputFormat (for OpenGraph stuff)
>   - The graph DB formats Neo4J and how the others are called
>
> Anything I missed for a "full" coverage?
>
> Could you please elaborate on this?
>
>
> Sure, DMOZ is some kind of crawled website database. It is used in some
> pagerank examples to test it, don't know if it was in Mahout. We could also
> use it since we have pagerank as well.
> CommonCrawl is a new up-coming DMOZ-like database of many crawled sites, it
> is hosted on S3 in Amazon Cloud. We run on EC2 via Whirr so this could be a
> cool example as well.
>
> [1]
>
> http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.html
>
>
> Am 25. März 2012 14:56 schrieb Praveen Sripati <praveensripati@gmail.com>:
>
> > Thomas et al,
> >
> > > Would someone please open JIRAs for that?
> >
> > I can open a JIRA. I need input on what all InputFormat makes sense and
> the
> > their priority. Some we can port from Hadoop.
> >
> > > Based on XML we can implement a format that parses DMOZ or commoncrawl
> on
> > Amzon S3.
> >
> > Could you please elaborate on this?
> >
> > Praveen
> >
> >
> > On Sun, Mar 25, 2012 at 5:14 PM, Chia-Hung Lin <clin4j@googlemail.com
> > >wrote:
> >
> > > As I understand, many iterative applications don't require key value
> > > input/ output and additionally need random access (read/ write) to
> > > particular file. I/O interface e.g. mpi may increase flexibility here.
> > >
> > > https://issues.apache.org/jira/browse/MAPREDUCE-2911
> > >
> > > On 25 March 2012 10:01, Praveen Sripati <praveensripati@gmail.com>
> > wrote:
> > > > Hi,
> > > >
> > > > For Hama there are limited input formats
> > > >
> > > > CombineFileInputFormat, FileInputFormat, NullInputFormat,
> > > > SequenceFileInputFormat, TextInputFormat
> > > >
> > > > Does it make sense to have to have more input formats? I was thinking
> > > > InputFormats for Graph Databases.
> > > >
> > > > Any feedback for the different input formats is welcome.
> > > >
> > > > I quickly glanced Giraph and Hadoop and they have more InputFormats
> > which
> > > > makes it easy to plug them with external systems.
> > > >
> > > > Praveen
> > >
> >
>
>
>
> --
> Thomas Jungblut
> Berlin <thomas.jungblut@gmail.com>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message