flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Pompermaier <pomperma...@okkam.it>
Subject Re: HBase 0.98 addon for Flink 0.8
Date Fri, 07 Nov 2014 11:44:38 GMT
I've just updated the code on my fork (synch with current master and
applied improvements coming from comments on related PR).
I still have to understand how to write results back to an HBase
Sink/OutputFormat...

On Mon, Nov 3, 2014 at 12:05 PM, Flavio Pompermaier <pompermaier@okkam.it>
wrote:

> Thanks for the detailed answer. So if I run a job from my machine I'll
> have to download all the scanned data in a table..right?
>
> Always regarding the GenericTableOutputFormat it is not clear to me how to
> proceed..
> I saw in the hadoop compatibility addon that it is possible to have such
> compatibility using HBaseUtils class so the open method should become
> something like:
>
> @Override
> public void open(int taskNumber, int numTasks) throws IOException {
> if (Integer.toString(taskNumber + 1).length() > 6) {
> throw new IOException("Task id too large.");
> }
> TaskAttemptID taskAttemptID = TaskAttemptID.forName("attempt__0000_r_"
> + String.format("%" + (6 - Integer.toString(taskNumber + 1).length()) +
> "s"," ").replace(" ", "0")
> + Integer.toString(taskNumber + 1)
> + "_0");
>  this.configuration.set("mapred.task.id", taskAttemptID.toString());
> this.configuration.setInt("mapred.task.partition", taskNumber + 1);
> // for hadoop 2.2
> this.configuration.set("mapreduce.task.attempt.id",
> taskAttemptID.toString());
> this.configuration.setInt("mapreduce.task.partition", taskNumber + 1);
>  try {
> this.context =
> HadoopUtils.instantiateTaskAttemptContext(this.configuration,
> taskAttemptID);
> } catch (Exception e) {
> throw new RuntimeException(e);
> }
> final HFileOutputFormat2 outFormat = new HFileOutputFormat2();
> try {
> this.writer = outFormat.getRecordWriter(this.context);
> } catch (InterruptedException iex) {
> throw new IOException("Opening the writer was interrupted.", iex);
> }
> }
>
> But I'm not sure about how to pass the JobConf to the class, if to merge
> config fileas, where HFileOutputFormat2 writes the data and how to
> implement the public void writeRecord(Record record) API.
> Could I do a little chat off the mailing list with the implementor of this
> extension?
>
> On Mon, Nov 3, 2014 at 11:51 AM, Fabian Hueske <fhueske@apache.org> wrote:
>
>> Hi Flavio
>>
>> let me try to answer your last question on the user's list (to the best of
>> my HBase knowledge).
>> "I just wanted to known if and how regiom splitting is handled. Can you
>> explain me in detail how Flink and HBase works?what is not fully clear to
>> me is when computation is done by region servers and when data start flow
>> to a Flink worker (that in ky test job is only my pc) and how ro
>> undertsand
>> better the important logged info to understand if my job is performing
>> well"
>>
>> HBase partitions its tables into so called "regions" of keys and stores
>> the
>> regions distributed in the cluster using HDFS. I think an HBase region can
>> be thought of as a HDFS block. To make reading an HBase table efficient,
>> region reads should be locally done, i.e., an InputFormat should primarily
>> read region that are stored on the same machine as the IF is running on.
>> Flink's InputSplits partition the HBase input by regions and add
>> information about the storage location of the region. During execution,
>> input splits are assigned to InputFormats that can do local reads.
>>
>> Best, Fabian
>>
>> 2014-11-03 11:13 GMT+01:00 Stephan Ewen <sewen@apache.org>:
>>
>> > Hi!
>> >
>> > The way of passing parameters through the configuration is very old (the
>> > original HBase format dated back to that time). I would simply make the
>> > HBase format take those parameters through the constructor.
>> >
>> > Greetings,
>> > Stephan
>> >
>> >
>> > On Mon, Nov 3, 2014 at 10:59 AM, Flavio Pompermaier <
>> pompermaier@okkam.it>
>> > wrote:
>> >
>> > > The problem is that I also removed the GenericTableOutputFormat
>> because
>> > > there is an incompatibility between hadoop1 and hadoop2 for class
>> > > TaskAttemptContext and TaskAttemptContextImpl..
>> > > then it would be nice if the user doesn't have to worry about passing
>> > > pact.hbase.jtkey and pact.job.id parameters..
>> > > I think it is probably a good idea to remove hadoop1 compatibility and
>> > keep
>> > > enable HBase addon only for hadoop2 (as before) and decide how to
>> mange
>> > > those 2 parameters..
>> > >
>> > > On Mon, Nov 3, 2014 at 10:19 AM, Stephan Ewen <sewen@apache.org>
>> wrote:
>> > >
>> > > > It is fine to remove it, in my opinion.
>> > > >
>> > > > On Mon, Nov 3, 2014 at 10:11 AM, Flavio Pompermaier <
>> > > pompermaier@okkam.it>
>> > > > wrote:
>> > > >
>> > > > > That is one class I removed because it was using the deprecated
>> API
>> > > > > GenericDataSink..I can restore them but the it will be a good
>> idea to
>> > > > > remove those warning (also because from what I understood the
>> Record
>> > > APIs
>> > > > > are going to be removed).
>> > > > >
>> > > > > On Mon, Nov 3, 2014 at 9:51 AM, Fabian Hueske <fhueske@apache.org
>> >
>> > > > wrote:
>> > > > >
>> > > > > > I'm not familiar with the HBase connector code, but are
you
>> maybe
>> > > > looking
>> > > > > > for the GenericTableOutputFormat?
>> > > > > >
>> > > > > > 2014-11-03 9:44 GMT+01:00 Flavio Pompermaier <
>> pompermaier@okkam.it
>> > >:
>> > > > > >
>> > > > > > > | was trying to modify the example setting hbaseDs.output(new
>> > > > > > > HBaseOutputFormat()); but I can't see any HBaseOutputFormat
>> > > > > class..maybe
>> > > > > > we
>> > > > > > > shall use another class?
>> > > > > > >
>> > > > > > > On Mon, Nov 3, 2014 at 9:39 AM, Flavio Pompermaier
<
>> > > > > pompermaier@okkam.it
>> > > > > > >
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Maybe that's something I could add to the HBase
example and
>> > that
>> > > > > could
>> > > > > > be
>> > > > > > > > better documented in the Wiki.
>> > > > > > > >
>> > > > > > > > Since we're talking about the wiki..I was looking
at the
>> Java
>> > > API (
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> http://flink.incubator.apache.org/docs/0.6-incubating/java_api_guide.html
>> )
>> > > > > > > > and the link to the KMeans example is not working
(where it
>> > says
>> > > > For
>> > > > > a
>> > > > > > > > complete example program, have a look at KMeans
Algorithm).
>> > > > > > > >
>> > > > > > > > Best,
>> > > > > > > > Flavio
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Mon, Nov 3, 2014 at 9:12 AM, Flavio Pompermaier
<
>> > > > > > pompermaier@okkam.it
>> > > > > > > >
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > >> Ah ok, perfect! That was the reason why I
removed it :)
>> > > > > > > >>
>> > > > > > > >> On Mon, Nov 3, 2014 at 9:10 AM, Stephan Ewen
<
>> > sewen@apache.org>
>> > > > > > wrote:
>> > > > > > > >>
>> > > > > > > >>> You do not really need a HBase data sink.
You can call
>> > > > > > > >>> "DataSet.output(new
>> > > > > > > >>> HBaseOutputFormat())"
>> > > > > > > >>>
>> > > > > > > >>> Stephan
>> > > > > > > >>> Am 02.11.2014 23:05 schrieb "Flavio Pompermaier"
<
>> > > > > > pompermaier@okkam.it
>> > > > > > > >:
>> > > > > > > >>>
>> > > > > > > >>> > Just one last thing..I removed the
HbaseDataSink
>> because I
>> > > > think
>> > > > > it
>> > > > > > > was
>> > > > > > > >>> > using the old APIs..can someone help
me in updating that
>> > > class?
>> > > > > > > >>> >
>> > > > > > > >>> > On Sun, Nov 2, 2014 at 10:55 AM,
Flavio Pompermaier <
>> > > > > > > >>> pompermaier@okkam.it>
>> > > > > > > >>> > wrote:
>> > > > > > > >>> >
>> > > > > > > >>> > > Indeed this time the build has
been successful :)
>> > > > > > > >>> > >
>> > > > > > > >>> > > On Sun, Nov 2, 2014 at 10:29
AM, Fabian Hueske <
>> > > > > > fhueske@apache.org
>> > > > > > > >
>> > > > > > > >>> > wrote:
>> > > > > > > >>> > >
>> > > > > > > >>> > >> You can also setup Travis
to build your own Github
>> > > > > repositories
>> > > > > > by
>> > > > > > > >>> > linking
>> > > > > > > >>> > >> it to your Github account.
That way Travis can build
>> all
>> > > > your
>> > > > > > > >>> branches
>> > > > > > > >>> > >> (and
>> > > > > > > >>> > >> you can also trigger rebuilds
if something fails).
>> > > > > > > >>> > >> Not sure if we can manually
trigger retrigger builds
>> on
>> > > the
>> > > > > > Apache
>> > > > > > > >>> > >> repository.
>> > > > > > > >>> > >>
>> > > > > > > >>> > >> Support for Hadoop 1 and
2 is indeed a very good
>> > addition
>> > > > :-)
>> > > > > > > >>> > >>
>> > > > > > > >>> > >> For the discusion about
the PR itself, I would need a
>> > bit
>> > > > more
>> > > > > > > time
>> > > > > > > >>> to
>> > > > > > > >>> > >> become more familiar with
HBase. I do also not have a
>> > > HBase
>> > > > > > setup
>> > > > > > > >>> > >> available
>> > > > > > > >>> > >> here.
>> > > > > > > >>> > >> Maybe somebody else of the
community who was involved
>> > > with a
>> > > > > > > >>> previous
>> > > > > > > >>> > >> version of the HBase connector
could comment on your
>> > > > question.
>> > > > > > > >>> > >>
>> > > > > > > >>> > >> Best, Fabian
>> > > > > > > >>> > >>
>> > > > > > > >>> > >> 2014-11-02 9:57 GMT+01:00
Flavio Pompermaier <
>> > > > > > > pompermaier@okkam.it
>> > > > > > > >>> >:
>> > > > > > > >>> > >>
>> > > > > > > >>> > >> > As suggestes by Fabian
I moved the discussion on
>> this
>> > > > > mailing
>> > > > > > > >>> list.
>> > > > > > > >>> > >> >
>> > > > > > > >>> > >> > I think that what is
still to be discussed is how
>> to
>> > > > > > retrigger
>> > > > > > > >>> the
>> > > > > > > >>> > >> build
>> > > > > > > >>> > >> > on Travis (I don't
have an account) and if the PR
>> can
>> > be
>> > > > > > > >>> integrated.
>> > > > > > > >>> > >> >
>> > > > > > > >>> > >> > Maybe what I can do
is to move the HBase example in
>> > the
>> > > > test
>> > > > > > > >>> package
>> > > > > > > >>> > >> (right
>> > > > > > > >>> > >> > now I left it in the
main folder) so it will force
>> > > Travis
>> > > > to
>> > > > > > > >>> rebuild.
>> > > > > > > >>> > >> > I'll do it within a
couple of hours.
>> > > > > > > >>> > >> >
>> > > > > > > >>> > >> > Another thing I forgot
to say is that the hbase
>> > > extension
>> > > > is
>> > > > > > now
>> > > > > > > >>> > >> compatible
>> > > > > > > >>> > >> > with both hadoop 1
and 2.
>> > > > > > > >>> > >> >
>> > > > > > > >>> > >> > Best,
>> > > > > > > >>> > >> > Flavio
>> > > > > > > >>> > >>
>> > > > > > > >>> > >
>> > > > > > > >>> >
>> > > > > > > >>>
>> > > > > > > >>
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message