crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Friedrich <m...@mafr.de>
Subject Re: JDBC parallel
Date Mon, 18 Mar 2013 17:10:35 GMT
Hi,

IIRC, the code in Crunch is inherently sequential and meant for
small(ish) amounts of data. After all, distributed read with Hadoop
from a RDBMS is often considered a DDoS attack :)

Regards,
  Matthias

On Monday, 2013-03-18, Josh Wills wrote:
> Hey Martjin,
> 
> I don't have any intuition on this one-- is this code that you could post
> as a gist or something so I could play with it and see if I see anything
> amiss? The trick will be figuring out if the problem is in Crunch, the
> underlying DB library, or the config.
> 
> J
> 
> 
> On Mon, Mar 18, 2013 at 6:50 AM, Martijn Lenderink
> <martijnrules@gmail.com>wrote:
> 
> > Hello,
> >
> > I have a working JDBC-connection to get data from an MSSQL source.
> > Its all works great except my cluster only opens one connection to the
> > MSSQL server.
> >
> > I have multiple nodes running but the data gets pulled only from one node
> > and then the data get send to other nodes for processing.
> >
> > I'am using code similar to the following:
> >
> > https://github.com/apache/incubator-crunch/blob/master/crunch-contrib/src/it/java/org/apache/crunch/contrib/io/jdbc/DataBaseSourceIT.java
> >
> > The only difference is the i'am using the DataDrivenDBInputFormat.
> >
> > When i debug the source-code the query gets split into multiple queries
> > but only get executed on one machine.
> > Why isn't this executed in parallel with multiple connections to the MSSQL
> > server?
> >
> > Greetings,
> > Martijn Lenderink
> >
> >
> 
> 
> -- 
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message