flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Pompermaier <pomperma...@okkam.it>
Subject Re: HBase 0.98 addon for Flink 0.8
Date Mon, 03 Nov 2014 11:05:34 GMT
Thanks for the detailed answer. So if I run a job from my machine I'll have
to download all the scanned data in a table..right?

Always regarding the GenericTableOutputFormat it is not clear to me how to
proceed..
I saw in the hadoop compatibility addon that it is possible to have such
compatibility using HBaseUtils class so the open method should become
something like:

@Override
public void open(int taskNumber, int numTasks) throws IOException {
if (Integer.toString(taskNumber + 1).length() > 6) {
throw new IOException("Task id too large.");
}
TaskAttemptID taskAttemptID = TaskAttemptID.forName("attempt__0000_r_"
+ String.format("%" + (6 - Integer.toString(taskNumber + 1).length()) +
"s"," ").replace(" ", "0")
+ Integer.toString(taskNumber + 1)
+ "_0");
 this.configuration.set("mapred.task.id", taskAttemptID.toString());
this.configuration.setInt("mapred.task.partition", taskNumber + 1);
// for hadoop 2.2
this.configuration.set("mapreduce.task.attempt.id",
taskAttemptID.toString());
this.configuration.setInt("mapreduce.task.partition", taskNumber + 1);
 try {
this.context =
HadoopUtils.instantiateTaskAttemptContext(this.configuration,
taskAttemptID);
} catch (Exception e) {
throw new RuntimeException(e);
}
final HFileOutputFormat2 outFormat = new HFileOutputFormat2();
try {
this.writer = outFormat.getRecordWriter(this.context);
} catch (InterruptedException iex) {
throw new IOException("Opening the writer was interrupted.", iex);
}
}

But I'm not sure about how to pass the JobConf to the class, if to merge
config fileas, where HFileOutputFormat2 writes the data and how to
implement the public void writeRecord(Record record) API.
Could I do a little chat off the mailing list with the implementor of this
extension?

On Mon, Nov 3, 2014 at 11:51 AM, Fabian Hueske <fhueske@apache.org> wrote:

> Hi Flavio
>
> let me try to answer your last question on the user's list (to the best of
> my HBase knowledge).
> "I just wanted to known if and how regiom splitting is handled. Can you
> explain me in detail how Flink and HBase works?what is not fully clear to
> me is when computation is done by region servers and when data start flow
> to a Flink worker (that in ky test job is only my pc) and how ro undertsand
> better the important logged info to understand if my job is performing
> well"
>
> HBase partitions its tables into so called "regions" of keys and stores the
> regions distributed in the cluster using HDFS. I think an HBase region can
> be thought of as a HDFS block. To make reading an HBase table efficient,
> region reads should be locally done, i.e., an InputFormat should primarily
> read region that are stored on the same machine as the IF is running on.
> Flink's InputSplits partition the HBase input by regions and add
> information about the storage location of the region. During execution,
> input splits are assigned to InputFormats that can do local reads.
>
> Best, Fabian
>
> 2014-11-03 11:13 GMT+01:00 Stephan Ewen <sewen@apache.org>:
>
> > Hi!
> >
> > The way of passing parameters through the configuration is very old (the
> > original HBase format dated back to that time). I would simply make the
> > HBase format take those parameters through the constructor.
> >
> > Greetings,
> > Stephan
> >
> >
> > On Mon, Nov 3, 2014 at 10:59 AM, Flavio Pompermaier <
> pompermaier@okkam.it>
> > wrote:
> >
> > > The problem is that I also removed the GenericTableOutputFormat because
> > > there is an incompatibility between hadoop1 and hadoop2 for class
> > > TaskAttemptContext and TaskAttemptContextImpl..
> > > then it would be nice if the user doesn't have to worry about passing
> > > pact.hbase.jtkey and pact.job.id parameters..
> > > I think it is probably a good idea to remove hadoop1 compatibility and
> > keep
> > > enable HBase addon only for hadoop2 (as before) and decide how to mange
> > > those 2 parameters..
> > >
> > > On Mon, Nov 3, 2014 at 10:19 AM, Stephan Ewen <sewen@apache.org>
> wrote:
> > >
> > > > It is fine to remove it, in my opinion.
> > > >
> > > > On Mon, Nov 3, 2014 at 10:11 AM, Flavio Pompermaier <
> > > pompermaier@okkam.it>
> > > > wrote:
> > > >
> > > > > That is one class I removed because it was using the deprecated API
> > > > > GenericDataSink..I can restore them but the it will be a good idea
> to
> > > > > remove those warning (also because from what I understood the
> Record
> > > APIs
> > > > > are going to be removed).
> > > > >
> > > > > On Mon, Nov 3, 2014 at 9:51 AM, Fabian Hueske <fhueske@apache.org>
> > > > wrote:
> > > > >
> > > > > > I'm not familiar with the HBase connector code, but are you
maybe
> > > > looking
> > > > > > for the GenericTableOutputFormat?
> > > > > >
> > > > > > 2014-11-03 9:44 GMT+01:00 Flavio Pompermaier <
> pompermaier@okkam.it
> > >:
> > > > > >
> > > > > > > | was trying to modify the example setting hbaseDs.output(new
> > > > > > > HBaseOutputFormat()); but I can't see any HBaseOutputFormat
> > > > > class..maybe
> > > > > > we
> > > > > > > shall use another class?
> > > > > > >
> > > > > > > On Mon, Nov 3, 2014 at 9:39 AM, Flavio Pompermaier <
> > > > > pompermaier@okkam.it
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Maybe that's something I could add to the HBase example
and
> > that
> > > > > could
> > > > > > be
> > > > > > > > better documented in the Wiki.
> > > > > > > >
> > > > > > > > Since we're talking about the wiki..I was looking
at the Java
> > > API (
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://flink.incubator.apache.org/docs/0.6-incubating/java_api_guide.html)
> > > > > > > > and the link to the KMeans example is not working
(where it
> > says
> > > > For
> > > > > a
> > > > > > > > complete example program, have a look at KMeans Algorithm).
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Flavio
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Nov 3, 2014 at 9:12 AM, Flavio Pompermaier
<
> > > > > > pompermaier@okkam.it
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Ah ok, perfect! That was the reason why I removed
it :)
> > > > > > > >>
> > > > > > > >> On Mon, Nov 3, 2014 at 9:10 AM, Stephan Ewen <
> > sewen@apache.org>
> > > > > > wrote:
> > > > > > > >>
> > > > > > > >>> You do not really need a HBase data sink.
You can call
> > > > > > > >>> "DataSet.output(new
> > > > > > > >>> HBaseOutputFormat())"
> > > > > > > >>>
> > > > > > > >>> Stephan
> > > > > > > >>> Am 02.11.2014 23:05 schrieb "Flavio Pompermaier"
<
> > > > > > pompermaier@okkam.it
> > > > > > > >:
> > > > > > > >>>
> > > > > > > >>> > Just one last thing..I removed the HbaseDataSink
because
> I
> > > > think
> > > > > it
> > > > > > > was
> > > > > > > >>> > using the old APIs..can someone help
me in updating that
> > > class?
> > > > > > > >>> >
> > > > > > > >>> > On Sun, Nov 2, 2014 at 10:55 AM, Flavio
Pompermaier <
> > > > > > > >>> pompermaier@okkam.it>
> > > > > > > >>> > wrote:
> > > > > > > >>> >
> > > > > > > >>> > > Indeed this time the build has been
successful :)
> > > > > > > >>> > >
> > > > > > > >>> > > On Sun, Nov 2, 2014 at 10:29 AM,
Fabian Hueske <
> > > > > > fhueske@apache.org
> > > > > > > >
> > > > > > > >>> > wrote:
> > > > > > > >>> > >
> > > > > > > >>> > >> You can also setup Travis to
build your own Github
> > > > > repositories
> > > > > > by
> > > > > > > >>> > linking
> > > > > > > >>> > >> it to your Github account. That
way Travis can build
> all
> > > > your
> > > > > > > >>> branches
> > > > > > > >>> > >> (and
> > > > > > > >>> > >> you can also trigger rebuilds
if something fails).
> > > > > > > >>> > >> Not sure if we can manually
trigger retrigger builds
> on
> > > the
> > > > > > Apache
> > > > > > > >>> > >> repository.
> > > > > > > >>> > >>
> > > > > > > >>> > >> Support for Hadoop 1 and 2 is
indeed a very good
> > addition
> > > > :-)
> > > > > > > >>> > >>
> > > > > > > >>> > >> For the discusion about the
PR itself, I would need a
> > bit
> > > > more
> > > > > > > time
> > > > > > > >>> to
> > > > > > > >>> > >> become more familiar with HBase.
I do also not have a
> > > HBase
> > > > > > setup
> > > > > > > >>> > >> available
> > > > > > > >>> > >> here.
> > > > > > > >>> > >> Maybe somebody else of the community
who was involved
> > > with a
> > > > > > > >>> previous
> > > > > > > >>> > >> version of the HBase connector
could comment on your
> > > > question.
> > > > > > > >>> > >>
> > > > > > > >>> > >> Best, Fabian
> > > > > > > >>> > >>
> > > > > > > >>> > >> 2014-11-02 9:57 GMT+01:00 Flavio
Pompermaier <
> > > > > > > pompermaier@okkam.it
> > > > > > > >>> >:
> > > > > > > >>> > >>
> > > > > > > >>> > >> > As suggestes by Fabian
I moved the discussion on
> this
> > > > > mailing
> > > > > > > >>> list.
> > > > > > > >>> > >> >
> > > > > > > >>> > >> > I think that what is still
to be discussed is how
> to
> > > > > > retrigger
> > > > > > > >>> the
> > > > > > > >>> > >> build
> > > > > > > >>> > >> > on Travis (I don't have
an account) and if the PR
> can
> > be
> > > > > > > >>> integrated.
> > > > > > > >>> > >> >
> > > > > > > >>> > >> > Maybe what I can do is
to move the HBase example in
> > the
> > > > test
> > > > > > > >>> package
> > > > > > > >>> > >> (right
> > > > > > > >>> > >> > now I left it in the main
folder) so it will force
> > > Travis
> > > > to
> > > > > > > >>> rebuild.
> > > > > > > >>> > >> > I'll do it within a couple
of hours.
> > > > > > > >>> > >> >
> > > > > > > >>> > >> > Another thing I forgot
to say is that the hbase
> > > extension
> > > > is
> > > > > > now
> > > > > > > >>> > >> compatible
> > > > > > > >>> > >> > with both hadoop 1 and
2.
> > > > > > > >>> > >> >
> > > > > > > >>> > >> > Best,
> > > > > > > >>> > >> > Flavio
> > > > > > > >>> > >>
> > > > > > > >>> > >
> > > > > > > >>> >
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message