hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Jones <nick.jo...@amd.com>
Subject Re: Some general questions about DBInputFormat
Date Tue, 11 Sep 2012 21:35:56 GMT
Hi Yaron,

I haven't looked at/used it in awhile but I seem to remember that each 
mapper's SQL request was wrapped in a transaction to prevent the number 
of rows changing.  DBInputFormat uses 
Connection.TRANSACTION_SERIALIZABLE from java.sql.Connection to prevent 
changes in the number of rows selected from a where clause.

The locking behavior I observed may have also been related to how MySQL 
was setup at the time.

On 09/11/2012 09:25 AM, Yaron Gonen wrote:
> Thanks for the fast response.
> Nick, regarding locking a table: as far as I understood from the code, 
> each mapper opens its own connection to the DB. I didn't see any code 
> such that the job creates a transaction and passes it to the mapper. 
> Did I miss something?
> again, thanks!
> On Tue, Sep 11, 2012 at 4:00 PM, Nick Jones <nick.jones@amd.com 
> <mailto:nick.jones@amd.com>> wrote:
>     Hi Yaron
>     Replies inline below.
>     On 09/11/2012 07:41 AM, Yaron Gonen wrote:
>         Hi,
>         After reviewing the class's (not very complicated) code, I
>         have some questions I hope someone can answer:
>           * (more general question) Are there many use-cases for using
>             DBInputFormat? Do most Hadoop jobs take their input from
>         files or DBs?
>     Bejoy's right, most jobs utilize data across HDFS or some other
>     distributed architecture to feed M/R at a sufficient rate.
>     DBInputFormat could be helpful in pulling pointers to other
>     sources of data (e.g. file paths for filers where actual binary
>     content is stored).
>           * What happens when the database is updated during mappers'
>         data
>             retrieval phase? is there a way to lock the database
>         before the
>             data retrieval phase and release it afterwords?
>     The whole job creates a transaction against the RBDMS that ensures
>     consistent state throughout the job.  Depending on the source and
>     settings, this might entirely lock a table or lock the selected
>     rows by the query.
>           * Since all mappers open a connection to the same DBS, one
>         cannot
>             use hundreds of mapper. Is there a solution to this problem?
>     Depends on the connection limits and the number of rows requested.
>     I've found that the server suffered other problems first before
>     connection count limitations.
>         Thanks,
>         Yaron

View raw message