hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject Re: InputFormats for Hama
Date Wed, 28 Mar 2012 01:35:38 GMT
Great Praveen!

On Wed, Mar 28, 2012 at 10:33 AM, Praveen Sripati
<praveensripati@gmail.com> wrote:
> Ed,
>
> After I have done porting Hadoop formats to Hama, I can work on it.
>
> I have created a sub-task HAMA-544 for HBase InputFormat.
>
> Praveen
>
> On Wed, Mar 28, 2012 at 4:33 AM, Edward J. Yoon <edwardyoon@apache.org>wrote:
>
>> Nice discussion!
>>
>> BTW, Anyone interested in contributing HBase table input/output formatters?
>>
>> On Mon, Mar 26, 2012 at 2:27 AM, Thomas Jungblut
>> <thomas.jungblut@googlemail.com> wrote:
>> > Thanks for your time.
>> > I have tweeted about the graph db formats, I know some of my followers
>> are
>> > working with them, so they might be interested.
>> >
>> > Am 25. März 2012 19:25 schrieb Praveen Sripati <praveensripati@gmail.com
>> >:
>> >
>> >> I have created Umbrella JIRA HAMA-536 for creating the
>> >> InputFormats/OutputFormats with three sub-tasks. For now I have assigned
>> >> the tasks to me, let me know if anyone is interested.
>> >>
>> >> Praveen
>> >>
>> >> On Sun, Mar 25, 2012 at 6:40 PM, Thomas Jungblut <
>> >> thomas.jungblut@googlemail.com> wrote:
>> >>
>> >> > >
>> >> > > I can open a JIRA. I need input on what all InputFormat makes
sense
>> and
>> >> > the
>> >> > > their priority. Some we can port from Hadoop.
>> >> >
>> >> >
>> >> > Yep, you're right. I guess a single JIRA would be enough for the
>> already
>> >> > implemented formats in Hadoop, for the others we need subclasses.
>> >> > Formats that I really wanted to have would be:
>> >> >
>> >> >   - DBInputFormat[1]
>> >> >   - XMLInputFormat
>> >> >   - NLineInputFormat
>> >> >   - CSVInputFormat (we could use OpenCSV for that in conjunction with
>> >> >   TextInputFormat)
>> >> >   - JSONInputFormat (for OpenGraph stuff)
>> >> >   - The graph DB formats Neo4J and how the others are called
>> >> >
>> >> > Anything I missed for a "full" coverage?
>> >> >
>> >> > Could you please elaborate on this?
>> >> >
>> >> >
>> >> > Sure, DMOZ is some kind of crawled website database. It is used in
>> some
>> >> > pagerank examples to test it, don't know if it was in Mahout. We could
>> >> also
>> >> > use it since we have pagerank as well.
>> >> > CommonCrawl is a new up-coming DMOZ-like database of many crawled
>> sites,
>> >> it
>> >> > is hosted on S3 in Amazon Cloud. We run on EC2 via Whirr so this could
>> >> be a
>> >> > cool example as well.
>> >> >
>> >> > [1]
>> >> >
>> >> >
>> >>
>> http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.html
>> >> >
>> >> >
>> >> > Am 25. März 2012 14:56 schrieb Praveen Sripati <
>> praveensripati@gmail.com
>> >> >:
>> >> >
>> >> > > Thomas et al,
>> >> > >
>> >> > > > Would someone please open JIRAs for that?
>> >> > >
>> >> > > I can open a JIRA. I need input on what all InputFormat makes
sense
>> and
>> >> > the
>> >> > > their priority. Some we can port from Hadoop.
>> >> > >
>> >> > > > Based on XML we can implement a format that parses DMOZ or
>> >> commoncrawl
>> >> > on
>> >> > > Amzon S3.
>> >> > >
>> >> > > Could you please elaborate on this?
>> >> > >
>> >> > > Praveen
>> >> > >
>> >> > >
>> >> > > On Sun, Mar 25, 2012 at 5:14 PM, Chia-Hung Lin <
>> clin4j@googlemail.com
>> >> > > >wrote:
>> >> > >
>> >> > > > As I understand, many iterative applications don't require
key
>> value
>> >> > > > input/ output and additionally need random access (read/
write) to
>> >> > > > particular file. I/O interface e.g. mpi may increase flexibility
>> >> here.
>> >> > > >
>> >> > > > https://issues.apache.org/jira/browse/MAPREDUCE-2911
>> >> > > >
>> >> > > > On 25 March 2012 10:01, Praveen Sripati <praveensripati@gmail.com
>> >
>> >> > > wrote:
>> >> > > > > Hi,
>> >> > > > >
>> >> > > > > For Hama there are limited input formats
>> >> > > > >
>> >> > > > > CombineFileInputFormat, FileInputFormat, NullInputFormat,
>> >> > > > > SequenceFileInputFormat, TextInputFormat
>> >> > > > >
>> >> > > > > Does it make sense to have to have more input formats?
I was
>> >> thinking
>> >> > > > > InputFormats for Graph Databases.
>> >> > > > >
>> >> > > > > Any feedback for the different input formats is welcome.
>> >> > > > >
>> >> > > > > I quickly glanced Giraph and Hadoop and they have more
>> InputFormats
>> >> > > which
>> >> > > > > makes it easy to plug them with external systems.
>> >> > > > >
>> >> > > > > Praveen
>> >> > > >
>> >> > >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Thomas Jungblut
>> >> > Berlin <thomas.jungblut@gmail.com>
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Thomas Jungblut
>> > Berlin <thomas.jungblut@gmail.com>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Mime
View raw message