Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7CF9BDB3E for ; Wed, 12 Sep 2012 13:55:00 +0000 (UTC) Received: (qmail 76148 invoked by uid 500); 12 Sep 2012 13:54:52 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 76012 invoked by uid 500); 12 Sep 2012 13:54:51 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 76004 invoked by uid 99); 12 Sep 2012 13:54:51 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Sep 2012 13:54:51 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of yaron.gonen@gmail.com designates 74.125.83.48 as permitted sender) Received: from [74.125.83.48] (HELO mail-ee0-f48.google.com) (74.125.83.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Sep 2012 13:54:44 +0000 Received: by eekd41 with SMTP id d41so1489631eek.35 for ; Wed, 12 Sep 2012 06:54:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=wDHOywN89uxfayUKkGtmOYu91G8mfAfl2U6Vwi2gA6U=; b=ys/M3E0hkjFeBKDCHydxhlCKX6dTXLcxiOF3xDRqLCJe2A/8w+TEpVFgEXIbJX3kwS 57ClcTd1IRmZM98HHuS8O+bUlIFWP2mQqq1R0NH94zgU+/sZrRwX7r0qAjAF8qE6M1an g3TdHpxmlhMUyppbJ0sRc/89wAdcYuCNPR2wO6Nf8Z0DqACSRiFy+i8P9OHp9JK9Dnlw 6cd5Pcx/tO5jDObL7aXkn+YdNvBmkG9ldxzOwLJXABqP2fmlm3kRGnvNT0tNvE8Cc2mI WWWG8EB22NUBqHiBBarkLC/JGThEQIKrNiagM7yaNsQooPB9pgQl+MXCi3tSmcC/9kjy IrOw== MIME-Version: 1.0 Received: by 10.204.129.215 with SMTP id p23mr6113204bks.3.1347458064370; Wed, 12 Sep 2012 06:54:24 -0700 (PDT) Received: by 10.204.127.142 with HTTP; Wed, 12 Sep 2012 06:54:24 -0700 (PDT) In-Reply-To: <504FAEBC.1000907@amd.com> References: <504F35EF.8050702@amd.com> <504FAEBC.1000907@amd.com> Date: Wed, 12 Sep 2012 16:54:24 +0300 Message-ID: Subject: Re: Some general questions about DBInputFormat From: Yaron Gonen To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=000e0cd1161046cc7904c9818787 --000e0cd1161046cc7904c9818787 Content-Type: text/plain; charset=ISO-8859-1 Hi again Nick, DBInputFormat does use Connection.TRANSACTION_SERIALIZABLE, but this a per connection attribute. Since every mapper has its own connection, and every connection is opened in a different time, every connection sees a different snapshot of the DB and it can cause for example two mapper that process the same record (if an insert command was performed). On Wed, Sep 12, 2012 at 12:35 AM, Nick Jones wrote: > Hi Yaron, > > I haven't looked at/used it in awhile but I seem to remember that each > mapper's SQL request was wrapped in a transaction to prevent the number of > rows changing. DBInputFormat uses Connection.TRANSACTION_SERIALIZABLE from > java.sql.Connection to prevent changes in the number of rows selected from > a where clause. > > The locking behavior I observed may have also been related to how MySQL > was setup at the time. > > > On 09/11/2012 09:25 AM, Yaron Gonen wrote: > > Thanks for the fast response. > Nick, regarding locking a table: as far as I understood from the code, > each mapper opens its own connection to the DB. I didn't see any code such > that the job creates a transaction and passes it to the mapper. Did I > miss something? > again, thanks! > > > On Tue, Sep 11, 2012 at 4:00 PM, Nick Jones wrote: > >> Hi Yaron >> >> Replies inline below. >> >> >> On 09/11/2012 07:41 AM, Yaron Gonen wrote: >> >>> Hi, >>> After reviewing the class's (not very complicated) code, I have some >>> questions I hope someone can answer: >>> >>> * (more general question) Are there many use-cases for using >>> >>> DBInputFormat? Do most Hadoop jobs take their input from files or >>> DBs? >>> >>> Bejoy's right, most jobs utilize data across HDFS or some other >> distributed architecture to feed M/R at a sufficient rate. DBInputFormat >> could be helpful in pulling pointers to other sources of data (e.g. file >> paths for filers where actual binary content is stored). >> >>> >>> * What happens when the database is updated during mappers' data >>> >>> retrieval phase? is there a way to lock the database before the >>> data retrieval phase and release it afterwords? >>> >>> The whole job creates a transaction against the RBDMS that ensures >> consistent state throughout the job. Depending on the source and settings, >> this might entirely lock a table or lock the selected rows by the query. >> >>> >>> * Since all mappers open a connection to the same DBS, one cannot >>> >>> use hundreds of mapper. Is there a solution to this problem? >>> >>> Depends on the connection limits and the number of rows requested. >> I've found that the server suffered other problems first before connection >> count limitations. >> >>> >>> Thanks, >>> Yaron >>> >> >> >> > > --000e0cd1161046cc7904c9818787 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi again Nick,
D= BInputFormat does use=A0Connec= tion.TRANSACTION_SERIALIZABLE, but this a per connection attribute. = Since every mapper has its own connection, and every connection is opened i= n a different time, every connection sees a different snapshot of the DB an= d it can cause for example two mapper that process the same record (if an i= nsert command was performed).

On Wed, Sep 12, 2012 at 12:35 AM, Nick Jones= <nick.jones@amd.com> wrote:
=20 =20 =20
Hi Yaron,

I haven't looked at/used it in awhile but I seem to remember that each mapper's SQL request was wrapped in a transaction to prevent the number of rows changing.=A0 DBInputFormat uses Connection.TRANSACTION_SERIALIZABLE from java.sql.Connection to prevent changes in the number of rows selected from a where clause.

The locking behavior I observed may have also been related to how MySQL was setup at the time.


On 09/11/2012 09:25 AM, Yaron Gonen wrote:
=20
Thanks for the fast response.
Nick, regarding locking a table: as far as I understood from the code, each mapper opens its own connection to the DB. I didn't see any code such that the job creates a transaction and passes it to the mapper. Did I miss=A0something?
again, thanks!


On Tue, Sep 11, 2012 at 4:00 PM, Nick Jones <nick.jones@amd.com> wrote:
Hi Yaron

Replies inline below.


On 09/11/2012 07:41 AM, Yaron Gonen wrote:
Hi,
After reviewing the class's (not very complicated) code, I have some questions I hope someone can answer:
=A0 * (more general question) Are there many use-cases for using

=A0 =A0 DBInputFormat? Do most Hadoop jobs take their input from files or DBs?

Bejoy's right, most jobs utilize data across HDFS or some other distributed architecture to feed M/R at a sufficient rate. DBInputFormat could be helpful in pulling pointers to other sources of data (e.g. file paths for filers where actual binary content is stored).

=A0 * What happens when the database is updated during mappers' data

=A0 =A0 retrieval phase? is there a way to lock the database before the
=A0 =A0 data retrieval phase and release it afterwords?
The whole job creates a transaction against the RBDMS that ensures consistent state throughout the job. =A0Depending on the source and settings, this might entirely lock a table or lock the selected rows by the query.

=A0 * Since all mappers open a connection to the same DBS, one cannot

=A0 =A0 use hundreds of mapper. Is there a solution to this problem?

Depends on the connection limits and the number of rows requested. I've found that the server suffered other problems first before connection count limitations.

Thanks,
Yaron





--000e0cd1161046cc7904c9818787--