Return-Path: X-Original-To: apmail-incubator-hama-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-hama-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 18AB69F9D for ; Sun, 25 Mar 2012 17:27:43 +0000 (UTC) Received: (qmail 55658 invoked by uid 500); 25 Mar 2012 17:27:43 -0000 Delivered-To: apmail-incubator-hama-dev-archive@incubator.apache.org Received: (qmail 55624 invoked by uid 500); 25 Mar 2012 17:27:43 -0000 Mailing-List: contact hama-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hama-dev@incubator.apache.org Delivered-To: mailing list hama-dev@incubator.apache.org Received: (qmail 55616 invoked by uid 99); 25 Mar 2012 17:27:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Mar 2012 17:27:42 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of thomas.jungblut@googlemail.com designates 209.85.212.47 as permitted sender) Received: from [209.85.212.47] (HELO mail-vb0-f47.google.com) (209.85.212.47) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Mar 2012 17:27:37 +0000 Received: by vbbfr13 with SMTP id fr13so2485805vbb.6 for ; Sun, 25 Mar 2012 10:27:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=UMON1KfgLJFXxVXmLjr4ktehW+JL9JOy10fDYm47ET8=; b=FFUPQnzkMpzDS6xx9fDsIOFnYNgbm/ynd4VXl/vYX7BgcvVGUoEuDcgF2kf0k8Ugkd rGf+Uhld6m/Ao3RvGTSLobHxQKr0w3hiR/RLZeY5OAa6jCRR/iCBqiaeTvsrq2FKoujR OE637Ae13hRHs0jt3g/SpOMaYuKSzzIQp3yqwCUgtyNU+gLOEFUzTYJcjI4r/7sI/hPy BFk6SV4XNrQZ/AWPkCpNGEnzX+gyq9nyEv86meljk6VlmS/X/NetqldIj4+LPc+K2kFb 3neRgtUZorKI1n8ujUbGJTkIIw1oyMLOZdEhxbK9APqyf0C3wfWfHRfVlrg1FlvsC8vI 1fIA== MIME-Version: 1.0 Received: by 10.52.27.1 with SMTP id p1mr8320452vdg.17.1332696436875; Sun, 25 Mar 2012 10:27:16 -0700 (PDT) Received: by 10.220.215.3 with HTTP; Sun, 25 Mar 2012 10:27:16 -0700 (PDT) In-Reply-To: References: Date: Sun, 25 Mar 2012 19:27:16 +0200 Message-ID: Subject: Re: InputFormats for Hama From: Thomas Jungblut To: hama-dev@incubator.apache.org Content-Type: multipart/alternative; boundary=20cf307d00f8b6a04d04bc1491b4 X-Virus-Checked: Checked by ClamAV on apache.org --20cf307d00f8b6a04d04bc1491b4 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Thanks for your time. I have tweeted about the graph db formats, I know some of my followers are working with them, so they might be interested. Am 25. M=E4rz 2012 19:25 schrieb Praveen Sripati = : > I have created Umbrella JIRA HAMA-536 for creating the > InputFormats/OutputFormats with three sub-tasks. For now I have assigned > the tasks to me, let me know if anyone is interested. > > Praveen > > On Sun, Mar 25, 2012 at 6:40 PM, Thomas Jungblut < > thomas.jungblut@googlemail.com> wrote: > > > > > > > I can open a JIRA. I need input on what all InputFormat makes sense a= nd > > the > > > their priority. Some we can port from Hadoop. > > > > > > Yep, you're right. I guess a single JIRA would be enough for the alread= y > > implemented formats in Hadoop, for the others we need subclasses. > > Formats that I really wanted to have would be: > > > > - DBInputFormat[1] > > - XMLInputFormat > > - NLineInputFormat > > - CSVInputFormat (we could use OpenCSV for that in conjunction with > > TextInputFormat) > > - JSONInputFormat (for OpenGraph stuff) > > - The graph DB formats Neo4J and how the others are called > > > > Anything I missed for a "full" coverage? > > > > Could you please elaborate on this? > > > > > > Sure, DMOZ is some kind of crawled website database. It is used in some > > pagerank examples to test it, don't know if it was in Mahout. We could > also > > use it since we have pagerank as well. > > CommonCrawl is a new up-coming DMOZ-like database of many crawled sites= , > it > > is hosted on S3 in Amazon Cloud. We run on EC2 via Whirr so this could > be a > > cool example as well. > > > > [1] > > > > > http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/map= reduce/lib/db/DBInputFormat.html > > > > > > Am 25. M=E4rz 2012 14:56 schrieb Praveen Sripati >: > > > > > Thomas et al, > > > > > > > Would someone please open JIRAs for that? > > > > > > I can open a JIRA. I need input on what all InputFormat makes sense a= nd > > the > > > their priority. Some we can port from Hadoop. > > > > > > > Based on XML we can implement a format that parses DMOZ or > commoncrawl > > on > > > Amzon S3. > > > > > > Could you please elaborate on this? > > > > > > Praveen > > > > > > > > > On Sun, Mar 25, 2012 at 5:14 PM, Chia-Hung Lin > > >wrote: > > > > > > > As I understand, many iterative applications don't require key valu= e > > > > input/ output and additionally need random access (read/ write) to > > > > particular file. I/O interface e.g. mpi may increase flexibility > here. > > > > > > > > https://issues.apache.org/jira/browse/MAPREDUCE-2911 > > > > > > > > On 25 March 2012 10:01, Praveen Sripati > > > wrote: > > > > > Hi, > > > > > > > > > > For Hama there are limited input formats > > > > > > > > > > CombineFileInputFormat, FileInputFormat, NullInputFormat, > > > > > SequenceFileInputFormat, TextInputFormat > > > > > > > > > > Does it make sense to have to have more input formats? I was > thinking > > > > > InputFormats for Graph Databases. > > > > > > > > > > Any feedback for the different input formats is welcome. > > > > > > > > > > I quickly glanced Giraph and Hadoop and they have more InputForma= ts > > > which > > > > > makes it easy to plug them with external systems. > > > > > > > > > > Praveen > > > > > > > > > > > > > > > -- > > Thomas Jungblut > > Berlin > > > --=20 Thomas Jungblut Berlin --20cf307d00f8b6a04d04bc1491b4--