incubator-hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Praveen Sripati <praveensrip...@gmail.com>
Subject Re: InputFormats for Hama
Date Sun, 25 Mar 2012 17:25:26 GMT
I have created Umbrella JIRA HAMA-536 for creating the
InputFormats/OutputFormats with three sub-tasks. For now I have assigned
the tasks to me, let me know if anyone is interested.

Praveen

On Sun, Mar 25, 2012 at 6:40 PM, Thomas Jungblut <
thomas.jungblut@googlemail.com> wrote:

> >
> > I can open a JIRA. I need input on what all InputFormat makes sense and
> the
> > their priority. Some we can port from Hadoop.
>
>
> Yep, you're right. I guess a single JIRA would be enough for the already
> implemented formats in Hadoop, for the others we need subclasses.
> Formats that I really wanted to have would be:
>
>   - DBInputFormat[1]
>   - XMLInputFormat
>   - NLineInputFormat
>   - CSVInputFormat (we could use OpenCSV for that in conjunction with
>   TextInputFormat)
>   - JSONInputFormat (for OpenGraph stuff)
>   - The graph DB formats Neo4J and how the others are called
>
> Anything I missed for a "full" coverage?
>
> Could you please elaborate on this?
>
>
> Sure, DMOZ is some kind of crawled website database. It is used in some
> pagerank examples to test it, don't know if it was in Mahout. We could also
> use it since we have pagerank as well.
> CommonCrawl is a new up-coming DMOZ-like database of many crawled sites, it
> is hosted on S3 in Amazon Cloud. We run on EC2 via Whirr so this could be a
> cool example as well.
>
> [1]
>
> http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.html
>
>
> Am 25. März 2012 14:56 schrieb Praveen Sripati <praveensripati@gmail.com>:
>
> > Thomas et al,
> >
> > > Would someone please open JIRAs for that?
> >
> > I can open a JIRA. I need input on what all InputFormat makes sense and
> the
> > their priority. Some we can port from Hadoop.
> >
> > > Based on XML we can implement a format that parses DMOZ or commoncrawl
> on
> > Amzon S3.
> >
> > Could you please elaborate on this?
> >
> > Praveen
> >
> >
> > On Sun, Mar 25, 2012 at 5:14 PM, Chia-Hung Lin <clin4j@googlemail.com
> > >wrote:
> >
> > > As I understand, many iterative applications don't require key value
> > > input/ output and additionally need random access (read/ write) to
> > > particular file. I/O interface e.g. mpi may increase flexibility here.
> > >
> > > https://issues.apache.org/jira/browse/MAPREDUCE-2911
> > >
> > > On 25 March 2012 10:01, Praveen Sripati <praveensripati@gmail.com>
> > wrote:
> > > > Hi,
> > > >
> > > > For Hama there are limited input formats
> > > >
> > > > CombineFileInputFormat, FileInputFormat, NullInputFormat,
> > > > SequenceFileInputFormat, TextInputFormat
> > > >
> > > > Does it make sense to have to have more input formats? I was thinking
> > > > InputFormats for Graph Databases.
> > > >
> > > > Any feedback for the different input formats is welcome.
> > > >
> > > > I quickly glanced Giraph and Hadoop and they have more InputFormats
> > which
> > > > makes it easy to plug them with external systems.
> > > >
> > > > Praveen
> > >
> >
>
>
>
> --
> Thomas Jungblut
> Berlin <thomas.jungblut@gmail.com>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message