hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Evans <ev...@yahoo-inc.com>
Subject Re: InputFormat for some REST api
Date Tue, 19 Feb 2013 17:34:59 GMT
I don't know of any input format that will do this out of the box.  But it should not be that
hard to write one.  There are two big issues here.

 1.  the data you are reading form the API really needs to be static, or you could get some
very odd inconsistencies. For example a node dies after a map task has finished and not all
of the reducers got the data, so the map task is rerun and some of the reducers have some
old data, and some of the reducers have new data.  This is the main reason to download the
data before processing it.  You can work around this by using the input format to run a map
only job that then writes the data out to a file before processing it the rest of the way.
 2.  You need a good way to partition the data from the API.  This can be difficult unless
the REST API provides a logical way to split this up.


From: Yaron Gonen <yaron.gonen@gmail.com<mailto:yaron.gonen@gmail.com>>
Reply-To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.apache.org<mailto:user@hadoop.apache.org>>
Date: Tuesday, February 19, 2013 4:49 AM
To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.apache.org<mailto:user@hadoop.apache.org>>
Subject: InputFormat for some REST api

Do you know of any InputFormat implemented for some REST api provider?
Usually when one needs to process data that is accessible only by REST, one should try to
download the data first someone, but what if you cannot download it?


View raw message