Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of yaron.gonen@gmail.com
 designates 74.125.83.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <504FAEBC.1000907@amd.com>
References: 
 <CAKj4Onxp9LSOozrN2R71sjqxBnZ7=Obi6YDb-D65RAX7Cj3npg@mail.gmail.com>
	<504F35EF.8050702@amd.com>
	<CAKj4OnxCU9UVWsuyZPBuvWJUOhz-VknSCnrg3nL-jfWsb9Kk5Q@mail.gmail.com>
	<504FAEBC.1000907@amd.com>
Date: Wed, 12 Sep 2012 16:54:24 +0300
Message-ID: 
 <CAKj4OnzQFoUAMAKy5_XoXp6+bJ7yv3r4S=PPtxUaPAZdyo_Wdw@mail.gmail.com>
Subject: Re: Some general questions about DBInputFormat
From: Yaron Gonen <yaron.gonen@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=000e0cd1161046cc7904c9818787

--000e0cd1161046cc7904c9818787
Content-Type: text/plain; charset=ISO-8859-1

Hi again Nick,
DBInputFormat does use Connection.TRANSACTION_SERIALIZABLE, but this a per
connection attribute. Since every mapper has its own connection, and every
connection is opened in a different time, every connection sees a different
snapshot of the DB and it can cause for example two mapper that process the
same record (if an insert command was performed).

On Wed, Sep 12, 2012 at 12:35 AM, Nick Jones <nick.jones@amd.com> wrote:

>  Hi Yaron,
>
> I haven't looked at/used it in awhile but I seem to remember that each
> mapper's SQL request was wrapped in a transaction to prevent the number of
> rows changing.  DBInputFormat uses Connection.TRANSACTION_SERIALIZABLE from
> java.sql.Connection to prevent changes in the number of rows selected from
> a where clause.
>
> The locking behavior I observed may have also been related to how MySQL
> was setup at the time.
>
>
> On 09/11/2012 09:25 AM, Yaron Gonen wrote:
>
> Thanks for the fast response.
> Nick, regarding locking a table: as far as I understood from the code,
> each mapper opens its own connection to the DB. I didn't see any code such
> that the job creates a transaction and passes it to the mapper. Did I
> miss something?
> again, thanks!
>
>
> On Tue, Sep 11, 2012 at 4:00 PM, Nick Jones <nick.jones@amd.com> wrote:
>
>> Hi Yaron
>>
>> Replies inline below.
>>
>>
>> On 09/11/2012 07:41 AM, Yaron Gonen wrote:
>>
>>>  Hi,
>>> After reviewing the class's (not very complicated) code, I have some
>>> questions I hope someone can answer:
>>>
>>>    * (more general question) Are there many use-cases for using
>>>
>>>     DBInputFormat? Do most Hadoop jobs take their input from files or
>>> DBs?
>>>
>>>  Bejoy's right, most jobs utilize data across HDFS or some other
>> distributed architecture to feed M/R at a sufficient rate. DBInputFormat
>> could be helpful in pulling pointers to other sources of data (e.g. file
>> paths for filers where actual binary content is stored).
>>
>>>
>>>   * What happens when the database is updated during mappers' data
>>>
>>>     retrieval phase? is there a way to lock the database before the
>>>     data retrieval phase and release it afterwords?
>>>
>>>  The whole job creates a transaction against the RBDMS that ensures
>> consistent state throughout the job.  Depending on the source and settings,
>> this might entirely lock a table or lock the selected rows by the query.
>>
>>>
>>>   * Since all mappers open a connection to the same DBS, one cannot
>>>
>>>     use hundreds of mapper. Is there a solution to this problem?
>>>
>>>  Depends on the connection limits and the number of rows requested.
>> I've found that the server suffered other problems first before connection
>> count limitations.
>>
>>>
>>> Thanks,
>>> Yaron
>>>
>>
>>
>>
>
>

--000e0cd1161046cc7904c9818787
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi again Nick,<div><font face=3D"courier new, monospace">D=
BInputFormat</font> does use=A0<font face=3D"courier new, monospace">Connec=
tion.TRANSACTION_SERIALIZABLE</font>, but this a per connection attribute. =
Since every mapper has its own connection, and every connection is opened i=
n a different time, every connection sees a different snapshot of the DB an=
d it can cause for example two mapper that process the same record (if an i=
nsert command was performed).<br>
<br><div class=3D"gmail_quote">On Wed, Sep 12, 2012 at 12:35 AM, Nick Jones=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:nick.jones@amd.com" target=3D"_bla=
nk">nick.jones@amd.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_=
quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1=
ex">

 =20
   =20
 =20
  <div bgcolor=3D"#FFFFFF" text=3D"#000000">
    Hi Yaron,<br>
    <br>
    I haven&#39;t looked at/used it in awhile but I seem to remember that
    each mapper&#39;s SQL request was wrapped in a transaction to prevent
    the number of rows changing.=A0 DBInputFormat uses
    Connection.TRANSACTION_SERIALIZABLE from java.sql.Connection to
    prevent changes in the number of rows selected from a where clause.<br>
    <br>
    The locking behavior I observed may have also been related to how
    MySQL was setup at the time.<div><div class=3D"h5"><br>
    <br>
    <div>On 09/11/2012 09:25 AM, Yaron Gonen
      wrote:<br>
    </div>
    <blockquote type=3D"cite">
     =20
      <div dir=3D"ltr">Thanks for the fast response.
        <div>Nick, regarding locking a table: as far as I understood
          from the code, each mapper opens its own connection to the DB.
          I didn&#39;t see any code such that the job creates a transaction
          and passes it to the mapper. Did I miss=A0something?</div>
        <div>again, thanks!</div>
        <div><br>
          <br>
          <div class=3D"gmail_quote">On Tue, Sep 11, 2012 at 4:00 PM, Nick
            Jones <span dir=3D"ltr">&lt;<a href=3D"mailto:nick.jones@amd.co=
m" target=3D"_blank">nick.jones@amd.com</a>&gt;</span>
            wrote:<br>
            <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bo=
rder-left:1px #ccc solid;padding-left:1ex">Hi Yaron<br>
              <br>
              Replies inline below.
              <div><br>
                <br>
                On 09/11/2012 07:41 AM, Yaron Gonen wrote:<br>
              </div>
              <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;=
border-left:1px #ccc solid;padding-left:1ex">
                <div>
                  Hi,<br>
                  After reviewing the class&#39;s (not very complicated)
                  code, I have some questions I hope someone can answer:<br=
>
                  <br>
                </div>
                =A0 * (more general question) Are there many use-cases for
                using
                <div><br>
                  =A0 =A0 DBInputFormat? Do most Hadoop jobs take their
                  input from files or DBs?<br>
                  <br>
                </div>
              </blockquote>
              Bejoy&#39;s right, most jobs utilize data across HDFS or some
              other distributed architecture to feed M/R at a sufficient
              rate. DBInputFormat could be helpful in pulling pointers
              to other sources of data (e.g. file paths for filers where
              actual binary content is stored).<br>
              <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;=
border-left:1px #ccc solid;padding-left:1ex">
                <br>
                =A0 * What happens when the database is updated during
                mappers&#39; data
                <div><br>
                  =A0 =A0 retrieval phase? is there a way to lock the
                  database before the<br>
                  =A0 =A0 data retrieval phase and release it afterwords?<b=
r>
                  <br>
                </div>
              </blockquote>
              The whole job creates a transaction against the RBDMS that
              ensures consistent state throughout the job. =A0Depending on
              the source and settings, this might entirely lock a table
              or lock the selected rows by the query.<br>
              <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;=
border-left:1px #ccc solid;padding-left:1ex">
                <br>
                =A0 * Since all mappers open a connection to the same DBS,
                one cannot
                <div><br>
                  =A0 =A0 use hundreds of mapper. Is there a solution to
                  this problem?<br>
                  <br>
                </div>
              </blockquote>
              Depends on the connection limits and the number of rows
              requested. I&#39;ve found that the server suffered other
              problems first before connection count limitations.<br>
              <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;=
border-left:1px #ccc solid;padding-left:1ex">
                <br>
                Thanks,<br>
                Yaron<br>
              </blockquote>
              <br>
              <br>
            </blockquote>
          </div>
          <br>
        </div>
      </div>
    </blockquote>
    <br>
  </div></div></div>

</blockquote></div><br></div></div>

--000e0cd1161046cc7904c9818787--