incubator-hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Jungblut <thomas.jungb...@googlemail.com>
Subject Re: InputFormats for Hama
Date Sun, 25 Mar 2012 17:27:16 GMT
Thanks for your time.
I have tweeted about the graph db formats, I know some of my followers are
working with them, so they might be interested.

Am 25. März 2012 19:25 schrieb Praveen Sripati <praveensripati@gmail.com>:

> I have created Umbrella JIRA HAMA-536 for creating the
> InputFormats/OutputFormats with three sub-tasks. For now I have assigned
> the tasks to me, let me know if anyone is interested.
>
> Praveen
>
> On Sun, Mar 25, 2012 at 6:40 PM, Thomas Jungblut <
> thomas.jungblut@googlemail.com> wrote:
>
> > >
> > > I can open a JIRA. I need input on what all InputFormat makes sense and
> > the
> > > their priority. Some we can port from Hadoop.
> >
> >
> > Yep, you're right. I guess a single JIRA would be enough for the already
> > implemented formats in Hadoop, for the others we need subclasses.
> > Formats that I really wanted to have would be:
> >
> >   - DBInputFormat[1]
> >   - XMLInputFormat
> >   - NLineInputFormat
> >   - CSVInputFormat (we could use OpenCSV for that in conjunction with
> >   TextInputFormat)
> >   - JSONInputFormat (for OpenGraph stuff)
> >   - The graph DB formats Neo4J and how the others are called
> >
> > Anything I missed for a "full" coverage?
> >
> > Could you please elaborate on this?
> >
> >
> > Sure, DMOZ is some kind of crawled website database. It is used in some
> > pagerank examples to test it, don't know if it was in Mahout. We could
> also
> > use it since we have pagerank as well.
> > CommonCrawl is a new up-coming DMOZ-like database of many crawled sites,
> it
> > is hosted on S3 in Amazon Cloud. We run on EC2 via Whirr so this could
> be a
> > cool example as well.
> >
> > [1]
> >
> >
> http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.html
> >
> >
> > Am 25. März 2012 14:56 schrieb Praveen Sripati <praveensripati@gmail.com
> >:
> >
> > > Thomas et al,
> > >
> > > > Would someone please open JIRAs for that?
> > >
> > > I can open a JIRA. I need input on what all InputFormat makes sense and
> > the
> > > their priority. Some we can port from Hadoop.
> > >
> > > > Based on XML we can implement a format that parses DMOZ or
> commoncrawl
> > on
> > > Amzon S3.
> > >
> > > Could you please elaborate on this?
> > >
> > > Praveen
> > >
> > >
> > > On Sun, Mar 25, 2012 at 5:14 PM, Chia-Hung Lin <clin4j@googlemail.com
> > > >wrote:
> > >
> > > > As I understand, many iterative applications don't require key value
> > > > input/ output and additionally need random access (read/ write) to
> > > > particular file. I/O interface e.g. mpi may increase flexibility
> here.
> > > >
> > > > https://issues.apache.org/jira/browse/MAPREDUCE-2911
> > > >
> > > > On 25 March 2012 10:01, Praveen Sripati <praveensripati@gmail.com>
> > > wrote:
> > > > > Hi,
> > > > >
> > > > > For Hama there are limited input formats
> > > > >
> > > > > CombineFileInputFormat, FileInputFormat, NullInputFormat,
> > > > > SequenceFileInputFormat, TextInputFormat
> > > > >
> > > > > Does it make sense to have to have more input formats? I was
> thinking
> > > > > InputFormats for Graph Databases.
> > > > >
> > > > > Any feedback for the different input formats is welcome.
> > > > >
> > > > > I quickly glanced Giraph and Hadoop and they have more InputFormats
> > > which
> > > > > makes it easy to plug them with external systems.
> > > > >
> > > > > Praveen
> > > >
> > >
> >
> >
> >
> > --
> > Thomas Jungblut
> > Berlin <thomas.jungblut@gmail.com>
> >
>



-- 
Thomas Jungblut
Berlin <thomas.jungblut@gmail.com>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message