flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Pompermaier <pomperma...@okkam.it>
Subject Re: HBase 0.98 addon for Flink 0.8
Date Thu, 13 Nov 2014 17:36:12 GMT
Any help with this? :(

On Thu, Nov 13, 2014 at 2:06 PM, Flavio Pompermaier <pompermaier@okkam.it>
wrote:

> We definitely discovered that instantiating HTable and Scan in configure()
> method of TableInputFormat causes problem in distributed environment!
> If you look at my implementation at
> https://github.com/fpompermaier/incubator-flink/blob/master/flink-addons/flink-hbase/src/main/java/org/apache/flink/addons/hbase/TableInputFormat.java
> you can see that Scan and HTable were made transient and recreated within
> configure but this causes HBaseConfiguration.create() to fail searching for
> classpath files...could you help us understanding why?
>
> On Wed, Nov 12, 2014 at 8:10 PM, Flavio Pompermaier <pompermaier@okkam.it>
> wrote:
>
>> Usually, when I run a mapreduce job both on Spark and Hadoop I just put
>> *-site.xml files into the war I submit to the cluster and that's it. I
>> think the problem appeared when I made the HTable a private transient field
>> and the table istantiation was moved in the configure method.
>> Could it be a valid reason? we still have to make a deeper debug but I'm
>> trying ro figure out where to investigate..
>> On Nov 12, 2014 8:03 PM, "Robert Metzger" <rmetzger@apache.org> wrote:
>>
>>> Hi,
>>> Maybe its an issue with the classpath? As far as I know is Hadoop reading
>>> the configuration files from the classpath. Maybe is the hbase-site.xml
>>> file not accessible through the classpath when running on the cluster?
>>>
>>> On Wed, Nov 12, 2014 at 7:40 PM, Flavio Pompermaier <
>>> pompermaier@okkam.it>
>>> wrote:
>>>
>>> > Today we tried tp execute a job on the cluster instead of on local
>>> executor
>>> > and we faced the problem that the hbase-site.xml was basically
>>> ignored. Is
>>> > there a reason why the TableInputFormat is working correctly on local
>>> > environment while it doesn't on a cluster?
>>> > On Nov 10, 2014 10:56 AM, "Fabian Hueske" <fhueske@apache.org> wrote:
>>> >
>>> > > I don't think we need to bundle the HBase input and output format in
>>> a
>>> > > single PR.
>>> > > So, I think we can proceed with the IF only and target the OF later.
>>> > > However, the fix for Kryo should be in the master before merging the
>>> PR.
>>> > > Till is currently working on that and said he expects this to be
>>> done by
>>> > > end of the week.
>>> > >
>>> > > Cheers, Fabian
>>> > >
>>> > >
>>> > > 2014-11-07 12:49 GMT+01:00 Flavio Pompermaier <pompermaier@okkam.it
>>> >:
>>> > >
>>> > > > I fixed also the profile for Cloudera CDH5.1.3. You can build
it
>>> with
>>> > the
>>> > > > command:
>>> > > >       mvn clean install -Dmaven.test.skip=true -Dhadoop.profile=2
>>> > > >  -Pvendor-repos,cdh5.1.3
>>> > > >
>>> > > > However, it would be good to generate the specific jar when
>>> > > > releasing..(e.g.
>>> > > > flink-addons:flink-hbase:0.8.0-hadoop2-cdh5.1.3-incubating)
>>> > > >
>>> > > > Best,
>>> > > > Flavio
>>> > > >
>>> > > > On Fri, Nov 7, 2014 at 12:44 PM, Flavio Pompermaier <
>>> > > pompermaier@okkam.it>
>>> > > > wrote:
>>> > > >
>>> > > > > I've just updated the code on my fork (synch with current
master
>>> and
>>> > > > > applied improvements coming from comments on related PR).
>>> > > > > I still have to understand how to write results back to an
HBase
>>> > > > > Sink/OutputFormat...
>>> > > > >
>>> > > > >
>>> > > > > On Mon, Nov 3, 2014 at 12:05 PM, Flavio Pompermaier <
>>> > > > pompermaier@okkam.it>
>>> > > > > wrote:
>>> > > > >
>>> > > > >> Thanks for the detailed answer. So if I run a job from
my
>>> machine
>>> > I'll
>>> > > > >> have to download all the scanned data in a table..right?
>>> > > > >>
>>> > > > >> Always regarding the GenericTableOutputFormat it is not
clear
>>> to me
>>> > > how
>>> > > > >> to proceed..
>>> > > > >> I saw in the hadoop compatibility addon that it is possible
to
>>> have
>>> > > such
>>> > > > >> compatibility using HBaseUtils class so the open method
should
>>> > become
>>> > > > >> something like:
>>> > > > >>
>>> > > > >> @Override
>>> > > > >> public void open(int taskNumber, int numTasks) throws
>>> IOException {
>>> > > > >> if (Integer.toString(taskNumber + 1).length() > 6)
{
>>> > > > >> throw new IOException("Task id too large.");
>>> > > > >> }
>>> > > > >> TaskAttemptID taskAttemptID =
>>> > TaskAttemptID.forName("attempt__0000_r_"
>>> > > > >> + String.format("%" + (6 - Integer.toString(taskNumber
+
>>> > 1).length())
>>> > > +
>>> > > > >> "s"," ").replace(" ", "0")
>>> > > > >> + Integer.toString(taskNumber + 1)
>>> > > > >> + "_0");
>>> > > > >>  this.configuration.set("mapred.task.id",
>>> > taskAttemptID.toString());
>>> > > > >> this.configuration.setInt("mapred.task.partition", taskNumber
+
>>> 1);
>>> > > > >> // for hadoop 2.2
>>> > > > >> this.configuration.set("mapreduce.task.attempt.id",
>>> > > > >> taskAttemptID.toString());
>>> > > > >> this.configuration.setInt("mapreduce.task.partition",
>>> taskNumber +
>>> > 1);
>>> > > > >>  try {
>>> > > > >> this.context =
>>> > > > >> HadoopUtils.instantiateTaskAttemptContext(this.configuration,
>>> > > > >> taskAttemptID);
>>> > > > >> } catch (Exception e) {
>>> > > > >> throw new RuntimeException(e);
>>> > > > >> }
>>> > > > >> final HFileOutputFormat2 outFormat = new HFileOutputFormat2();
>>> > > > >> try {
>>> > > > >> this.writer = outFormat.getRecordWriter(this.context);
>>> > > > >> } catch (InterruptedException iex) {
>>> > > > >> throw new IOException("Opening the writer was interrupted.",
>>> iex);
>>> > > > >> }
>>> > > > >> }
>>> > > > >>
>>> > > > >> But I'm not sure about how to pass the JobConf to the
class, if
>>> to
>>> > > merge
>>> > > > >> config fileas, where HFileOutputFormat2 writes the data
and how
>>> to
>>> > > > >> implement the public void writeRecord(Record record)
API.
>>> > > > >> Could I do a little chat off the mailing list with the
>>> implementor
>>> > of
>>> > > > >> this extension?
>>> > > > >>
>>> > > > >> On Mon, Nov 3, 2014 at 11:51 AM, Fabian Hueske <
>>> fhueske@apache.org>
>>> > > > >> wrote:
>>> > > > >>
>>> > > > >>> Hi Flavio
>>> > > > >>>
>>> > > > >>> let me try to answer your last question on the user's
list (to
>>> the
>>> > > best
>>> > > > >>> of
>>> > > > >>> my HBase knowledge).
>>> > > > >>> "I just wanted to known if and how regiom splitting
is
>>> handled. Can
>>> > > you
>>> > > > >>> explain me in detail how Flink and HBase works?what
is not
>>> fully
>>> > > clear
>>> > > > to
>>> > > > >>> me is when computation is done by region servers
and when data
>>> > start
>>> > > > flow
>>> > > > >>> to a Flink worker (that in ky test job is only my
pc) and how
>>> ro
>>> > > > >>> undertsand
>>> > > > >>> better the important logged info to understand if
my job is
>>> > > performing
>>> > > > >>> well"
>>> > > > >>>
>>> > > > >>> HBase partitions its tables into so called "regions"
of keys
>>> and
>>> > > stores
>>> > > > >>> the
>>> > > > >>> regions distributed in the cluster using HDFS. I
think an HBase
>>> > > region
>>> > > > >>> can
>>> > > > >>> be thought of as a HDFS block. To make reading an
HBase table
>>> > > > efficient,
>>> > > > >>> region reads should be locally done, i.e., an InputFormat
>>> should
>>> > > > >>> primarily
>>> > > > >>> read region that are stored on the same machine as
the IF is
>>> > running
>>> > > > on.
>>> > > > >>> Flink's InputSplits partition the HBase input by
regions and
>>> add
>>> > > > >>> information about the storage location of the region.
During
>>> > > execution,
>>> > > > >>> input splits are assigned to InputFormats that can
do local
>>> reads.
>>> > > > >>>
>>> > > > >>> Best, Fabian
>>> > > > >>>
>>> > > > >>> 2014-11-03 11:13 GMT+01:00 Stephan Ewen <sewen@apache.org>:
>>> > > > >>>
>>> > > > >>> > Hi!
>>> > > > >>> >
>>> > > > >>> > The way of passing parameters through the configuration
is
>>> very
>>> > old
>>> > > > >>> (the
>>> > > > >>> > original HBase format dated back to that time).
I would
>>> simply
>>> > make
>>> > > > the
>>> > > > >>> > HBase format take those parameters through the
constructor.
>>> > > > >>> >
>>> > > > >>> > Greetings,
>>> > > > >>> > Stephan
>>> > > > >>> >
>>> > > > >>> >
>>> > > > >>> > On Mon, Nov 3, 2014 at 10:59 AM, Flavio Pompermaier
<
>>> > > > >>> pompermaier@okkam.it>
>>> > > > >>> > wrote:
>>> > > > >>> >
>>> > > > >>> > > The problem is that I also removed the
>>> GenericTableOutputFormat
>>> > > > >>> because
>>> > > > >>> > > there is an incompatibility between hadoop1
and hadoop2 for
>>> > class
>>> > > > >>> > > TaskAttemptContext and TaskAttemptContextImpl..
>>> > > > >>> > > then it would be nice if the user doesn't
have to worry
>>> about
>>> > > > passing
>>> > > > >>> > > pact.hbase.jtkey and pact.job.id parameters..
>>> > > > >>> > > I think it is probably a good idea to remove
hadoop1
>>> > > compatibility
>>> > > > >>> and
>>> > > > >>> > keep
>>> > > > >>> > > enable HBase addon only for hadoop2 (as
before) and decide
>>> how
>>> > to
>>> > > > >>> mange
>>> > > > >>> > > those 2 parameters..
>>> > > > >>> > >
>>> > > > >>> > > On Mon, Nov 3, 2014 at 10:19 AM, Stephan
Ewen <
>>> > sewen@apache.org>
>>> > > > >>> wrote:
>>> > > > >>> > >
>>> > > > >>> > > > It is fine to remove it, in my opinion.
>>> > > > >>> > > >
>>> > > > >>> > > > On Mon, Nov 3, 2014 at 10:11 AM, Flavio
Pompermaier <
>>> > > > >>> > > pompermaier@okkam.it>
>>> > > > >>> > > > wrote:
>>> > > > >>> > > >
>>> > > > >>> > > > > That is one class I removed because
it was using the
>>> > > deprecated
>>> > > > >>> API
>>> > > > >>> > > > > GenericDataSink..I can restore
them but the it will be
>>> a
>>> > good
>>> > > > >>> idea to
>>> > > > >>> > > > > remove those warning (also because
from what I
>>> understood
>>> > the
>>> > > > >>> Record
>>> > > > >>> > > APIs
>>> > > > >>> > > > > are going to be removed).
>>> > > > >>> > > > >
>>> > > > >>> > > > > On Mon, Nov 3, 2014 at 9:51 AM,
Fabian Hueske <
>>> > > > >>> fhueske@apache.org>
>>> > > > >>> > > > wrote:
>>> > > > >>> > > > >
>>> > > > >>> > > > > > I'm not familiar with the
HBase connector code, but
>>> are
>>> > you
>>> > > > >>> maybe
>>> > > > >>> > > > looking
>>> > > > >>> > > > > > for the GenericTableOutputFormat?
>>> > > > >>> > > > > >
>>> > > > >>> > > > > > 2014-11-03 9:44 GMT+01:00
Flavio Pompermaier <
>>> > > > >>> pompermaier@okkam.it
>>> > > > >>> > >:
>>> > > > >>> > > > > >
>>> > > > >>> > > > > > > | was trying to modify
the example setting
>>> > > > hbaseDs.output(new
>>> > > > >>> > > > > > > HBaseOutputFormat());
but I can't see any
>>> > > HBaseOutputFormat
>>> > > > >>> > > > > class..maybe
>>> > > > >>> > > > > > we
>>> > > > >>> > > > > > > shall use another class?
>>> > > > >>> > > > > > >
>>> > > > >>> > > > > > > On Mon, Nov 3, 2014
at 9:39 AM, Flavio Pompermaier
>>> <
>>> > > > >>> > > > > pompermaier@okkam.it
>>> > > > >>> > > > > > >
>>> > > > >>> > > > > > > wrote:
>>> > > > >>> > > > > > >
>>> > > > >>> > > > > > > > Maybe that's something
I could add to the HBase
>>> > example
>>> > > > and
>>> > > > >>> > that
>>> > > > >>> > > > > could
>>> > > > >>> > > > > > be
>>> > > > >>> > > > > > > > better documented
in the Wiki.
>>> > > > >>> > > > > > > >
>>> > > > >>> > > > > > > > Since we're talking
about the wiki..I was
>>> looking at
>>> > > the
>>> > > > >>> Java
>>> > > > >>> > > API (
>>> > > > >>> > > > > > > >
>>> > > > >>> > > > > > >
>>> > > > >>> > > > > >
>>> > > > >>> > > > >
>>> > > > >>> > > >
>>> > > > >>> > >
>>> > > > >>> >
>>> > > > >>>
>>> > > >
>>> > >
>>> >
>>> http://flink.incubator.apache.org/docs/0.6-incubating/java_api_guide.html
>>> > > > >>> )
>>> > > > >>> > > > > > > > and the link to
the KMeans example is not working
>>> > > (where
>>> > > > it
>>> > > > >>> > says
>>> > > > >>> > > > For
>>> > > > >>> > > > > a
>>> > > > >>> > > > > > > > complete example
program, have a look at KMeans
>>> > > > Algorithm).
>>> > > > >>> > > > > > > >
>>> > > > >>> > > > > > > > Best,
>>> > > > >>> > > > > > > > Flavio
>>> > > > >>> > > > > > > >
>>> > > > >>> > > > > > > >
>>> > > > >>> > > > > > > > On Mon, Nov 3,
2014 at 9:12 AM, Flavio
>>> Pompermaier <
>>> > > > >>> > > > > > pompermaier@okkam.it
>>> > > > >>> > > > > > > >
>>> > > > >>> > > > > > > > wrote:
>>> > > > >>> > > > > > > >
>>> > > > >>> > > > > > > >> Ah ok, perfect!
That was the reason why I
>>> removed it
>>> > > :)
>>> > > > >>> > > > > > > >>
>>> > > > >>> > > > > > > >> On Mon, Nov
3, 2014 at 9:10 AM, Stephan Ewen <
>>> > > > >>> > sewen@apache.org>
>>> > > > >>> > > > > > wrote:
>>> > > > >>> > > > > > > >>
>>> > > > >>> > > > > > > >>> You do
not really need a HBase data sink. You
>>> can
>>> > > call
>>> > > > >>> > > > > > > >>> "DataSet.output(new
>>> > > > >>> > > > > > > >>> HBaseOutputFormat())"
>>> > > > >>> > > > > > > >>>
>>> > > > >>> > > > > > > >>> Stephan
>>> > > > >>> > > > > > > >>> Am 02.11.2014
23:05 schrieb "Flavio
>>> Pompermaier" <
>>> > > > >>> > > > > > pompermaier@okkam.it
>>> > > > >>> > > > > > > >:
>>> > > > >>> > > > > > > >>>
>>> > > > >>> > > > > > > >>> > Just
one last thing..I removed the
>>> HbaseDataSink
>>> > > > >>> because I
>>> > > > >>> > > > think
>>> > > > >>> > > > > it
>>> > > > >>> > > > > > > was
>>> > > > >>> > > > > > > >>> > using
the old APIs..can someone help me in
>>> > updating
>>> > > > >>> that
>>> > > > >>> > > class?
>>> > > > >>> > > > > > > >>> >
>>> > > > >>> > > > > > > >>> > On
Sun, Nov 2, 2014 at 10:55 AM, Flavio
>>> > > Pompermaier <
>>> > > > >>> > > > > > > >>> pompermaier@okkam.it>
>>> > > > >>> > > > > > > >>> > wrote:
>>> > > > >>> > > > > > > >>> >
>>> > > > >>> > > > > > > >>> > >
Indeed this time the build has been
>>> successful
>>> > :)
>>> > > > >>> > > > > > > >>> > >
>>> > > > >>> > > > > > > >>> > >
On Sun, Nov 2, 2014 at 10:29 AM, Fabian
>>> Hueske
>>> > <
>>> > > > >>> > > > > > fhueske@apache.org
>>> > > > >>> > > > > > > >
>>> > > > >>> > > > > > > >>> > wrote:
>>> > > > >>> > > > > > > >>> > >
>>> > > > >>> > > > > > > >>> > >>
You can also setup Travis to build your
>>> own
>>> > > Github
>>> > > > >>> > > > > repositories
>>> > > > >>> > > > > > by
>>> > > > >>> > > > > > > >>> > linking
>>> > > > >>> > > > > > > >>> > >>
it to your Github account. That way
>>> Travis can
>>> > > > >>> build all
>>> > > > >>> > > > your
>>> > > > >>> > > > > > > >>> branches
>>> > > > >>> > > > > > > >>> > >>
(and
>>> > > > >>> > > > > > > >>> > >>
you can also trigger rebuilds if something
>>> > > fails).
>>> > > > >>> > > > > > > >>> > >>
Not sure if we can manually trigger
>>> retrigger
>>> > > > >>> builds on
>>> > > > >>> > > the
>>> > > > >>> > > > > > Apache
>>> > > > >>> > > > > > > >>> > >>
repository.
>>> > > > >>> > > > > > > >>> > >>
>>> > > > >>> > > > > > > >>> > >>
Support for Hadoop 1 and 2 is indeed a
>>> very
>>> > good
>>> > > > >>> > addition
>>> > > > >>> > > > :-)
>>> > > > >>> > > > > > > >>> > >>
>>> > > > >>> > > > > > > >>> > >>
For the discusion about the PR itself, I
>>> would
>>> > > > need
>>> > > > >>> a
>>> > > > >>> > bit
>>> > > > >>> > > > more
>>> > > > >>> > > > > > > time
>>> > > > >>> > > > > > > >>> to
>>> > > > >>> > > > > > > >>> > >>
become more familiar with HBase. I do
>>> also not
>>> > > > have
>>> > > > >>> a
>>> > > > >>> > > HBase
>>> > > > >>> > > > > > setup
>>> > > > >>> > > > > > > >>> > >>
available
>>> > > > >>> > > > > > > >>> > >>
here.
>>> > > > >>> > > > > > > >>> > >>
Maybe somebody else of the community who
>>> was
>>> > > > >>> involved
>>> > > > >>> > > with a
>>> > > > >>> > > > > > > >>> previous
>>> > > > >>> > > > > > > >>> > >>
version of the HBase connector could
>>> comment
>>> > on
>>> > > > your
>>> > > > >>> > > > question.
>>> > > > >>> > > > > > > >>> > >>
>>> > > > >>> > > > > > > >>> > >>
Best, Fabian
>>> > > > >>> > > > > > > >>> > >>
>>> > > > >>> > > > > > > >>> > >>
2014-11-02 9:57 GMT+01:00 Flavio
>>> Pompermaier <
>>> > > > >>> > > > > > > pompermaier@okkam.it
>>> > > > >>> > > > > > > >>> >:
>>> > > > >>> > > > > > > >>> > >>
>>> > > > >>> > > > > > > >>> > >>
> As suggestes by Fabian I moved the
>>> > discussion
>>> > > on
>>> > > > >>> this
>>> > > > >>> > > > > mailing
>>> > > > >>> > > > > > > >>> list.
>>> > > > >>> > > > > > > >>> > >>
>
>>> > > > >>> > > > > > > >>> > >>
> I think that what is still to be
>>> discussed
>>> > is
>>> > > > >>> how  to
>>> > > > >>> > > > > > retrigger
>>> > > > >>> > > > > > > >>> the
>>> > > > >>> > > > > > > >>> > >>
build
>>> > > > >>> > > > > > > >>> > >>
> on Travis (I don't have an account) and
>>> if
>>> > the
>>> > > > PR
>>> > > > >>> can
>>> > > > >>> > be
>>> > > > >>> > > > > > > >>> integrated.
>>> > > > >>> > > > > > > >>> > >>
>
>>> > > > >>> > > > > > > >>> > >>
> Maybe what I can do is to move the HBase
>>> > > example
>>> > > > >>> in
>>> > > > >>> > the
>>> > > > >>> > > > test
>>> > > > >>> > > > > > > >>> package
>>> > > > >>> > > > > > > >>> > >>
(right
>>> > > > >>> > > > > > > >>> > >>
> now I left it in the main folder) so it
>>> will
>>> > > > force
>>> > > > >>> > > Travis
>>> > > > >>> > > > to
>>> > > > >>> > > > > > > >>> rebuild.
>>> > > > >>> > > > > > > >>> > >>
> I'll do it within a couple of hours.
>>> > > > >>> > > > > > > >>> > >>
>
>>> > > > >>> > > > > > > >>> > >>
> Another thing I forgot to say is that
>>> the
>>> > > hbase
>>> > > > >>> > > extension
>>> > > > >>> > > > is
>>> > > > >>> > > > > > now
>>> > > > >>> > > > > > > >>> > >>
compatible
>>> > > > >>> > > > > > > >>> > >>
> with both hadoop 1 and 2.
>>> > > > >>> > > > > > > >>> > >>
>
>>> > > > >>> > > > > > > >>> > >>
> Best,
>>> > > > >>> > > > > > > >>> > >>
> Flavio
>>> > > > >>> > > > > > > >>> > >>
>>> > > > >>> > > > > > > >>> > >
>>> > > > >>> > > > > > > >>> >
>>> > > > >>> > > > > > > >>>
>>> > > > >>> > > > > > > >>
>>> > > > >>> > > > > > > >
>>> > > > >>> > > > > > >
>>> > > > >>> > > > > >
>>> > > > >>> > > > >
>>> > > > >>> > > >
>>> > > > >>> > >
>>> > > > >>> >
>>> > > > >>>
>>> > > > >>
>>> > > > >>
>>> > > > >>
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message