Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 45453200CF3 for ; Wed, 13 Sep 2017 11:26:30 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 439A11609CA; Wed, 13 Sep 2017 09:26:30 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E153B1609C9 for ; Wed, 13 Sep 2017 11:26:28 +0200 (CEST) Received: (qmail 88678 invoked by uid 500); 13 Sep 2017 09:26:28 -0000 Mailing-List: contact user-help@gobblin.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@gobblin.incubator.apache.org Delivered-To: mailing list user@gobblin.incubator.apache.org Received: (qmail 88668 invoked by uid 99); 13 Sep 2017 09:26:28 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Sep 2017 09:26:27 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 7EC8D1857E5 for ; Wed, 13 Sep 2017 09:26:27 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.38 X-Spam-Level: ** X-Spam-Status: No, score=2.38 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id oDv23M-CPF56 for ; Wed, 13 Sep 2017 09:26:25 +0000 (UTC) Received: from mail-qk0-f181.google.com (mail-qk0-f181.google.com [209.85.220.181]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 062185FB4E for ; Wed, 13 Sep 2017 09:26:25 +0000 (UTC) Received: by mail-qk0-f181.google.com with SMTP id z143so30370216qkb.3 for ; Wed, 13 Sep 2017 02:26:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=C/81rrqJfJ5TFtgbDm8jZ8pZSpnVyecuPAhSfwL2MLs=; b=ABvcglBvFPtdPXD10KsWu3MPHio+DR41feEwpcqM6jPll65sbc6qx0ggGdhmGLFx3G CJedyIchkUbwL1niTVBLPWX2MALhvRg09bHPcfP+E510PuPTjFqHnejuFDuojr6OiC4w l7NwrlOHVYJJJvGOaSC2CDZ0UTl6vPN7rgJ/a91VAY5DE7KxOea1RoWtbzND82nYx/Hr hK5eTPY0WKi2kYJLp7NQMZg/bg9HK7TnYdmZ1FxAqqRlVMKAUILydNTrKGpzRevsJEZI MAhu/7x0/R7Vy18BugXjqHebM6ICQlSlPGFzdwizVCl1CrnERHcTxAPJwRXLc90tg4QR K7Vg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=C/81rrqJfJ5TFtgbDm8jZ8pZSpnVyecuPAhSfwL2MLs=; b=icHLISrewaaol/lc88CQYFpePHALC9vYlyl/PrqqCnV/xnJV294PnMzZueOv2QIReW swvWQPFpXD15kQZsTCMnjIxMhWs7eqsRs27QqGFqOPbnuwhDphEGTWVrWpPx3VwfNnJj O7u9tFvuskU/2YX+J6UBB/wB2cipUd4D5lkz3xflOAuJUD7zw/+sVzaz31iV6GEG8XzD oh6ZuaCt4XJRmNn9geVhdDlG40EXRFNfkGGuDy7IAFqs+EtMcGpd/I54m5XUnPeUpzbr BxCjvSETdh2OVFHhr2tE1I3WBNYxBhKiuc5zAjdupW8n4rxea4NnDr1DeqFD99OySuxF V4Jw== X-Gm-Message-State: AHPjjUh18hjm4Xuw6nMKyYN5fAbVFPRvRbX8inKYAfJNf5DWvzBrhm+D nj64sSnum4bj7837F/kj/dbRuc3XCJPTxxCVLO1Lbw== X-Google-Smtp-Source: AOwi7QDNUTiMYAambi3wziWbGRjeNe3ZSoFasLjXiW8g0rpdiG5m6L33+TYn0zX1vE+LRKaOlCddIeeELJE3lTjPVvs= X-Received: by 10.55.42.5 with SMTP id q5mr25295797qkh.71.1505294783749; Wed, 13 Sep 2017 02:26:23 -0700 (PDT) MIME-Version: 1.0 Received: by 10.140.19.170 with HTTP; Wed, 13 Sep 2017 02:26:23 -0700 (PDT) In-Reply-To: <27ED8872-6D69-43AF-A84E-9ACC0BB253C1@linkedin.com> References: <27ED8872-6D69-43AF-A84E-9ACC0BB253C1@linkedin.com> From: Chackravarthy Esakkimuthu Date: Wed, 13 Sep 2017 14:56:23 +0530 Message-ID: Subject: Re: How to capture list of files touched during each gobblin run To: user@gobblin.incubator.apache.org Content-Type: multipart/alternative; boundary="001a11474486de255805590ec0e0" archived-at: Wed, 13 Sep 2017 09:26:30 -0000 --001a11474486de255805590ec0e0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sure thanks, I could able to write metadata by having custom publisher. On Tue, Sep 12, 2017 at 10:18 PM, Eric Ogren wrote: > Hi there =E2=80=93 > > > > If you are running a fairly new version of Gobblin, the FsWriterMetrics > class might have some of the info you are looking for. The writer will > collect the output filenames (and some other info like # records) and pla= ce > that it in state. As Abhishek mentioned, you would probably need to creat= e > a custom publisher to place these metrics in your target DB. > > > > Eric > > > > *From: *Chackravarthy Esakkimuthu > *Reply-To: *"user@gobblin.incubator.apache.org" apache.org> > *Date: *Tuesday, September 12, 2017 at 3:04 AM > *To: *"user@gobblin.incubator.apache.org" apache.org> > *Subject: *Re: How to capture list of files touched during each gobblin > run > > > > Thanks Abhishek, > > > > yes, having custom publisher I could able to get list of published output > dirs. And yes I could make my logic over there. > > > > One more clarification, > > > > I am seeing "data.publisher.output.dirs" holding list of output dirs (get > persisted in State object) ----> close() in BaseDataPublisher. > > Are these state getting stored somewhere already as part of job output? > > > > On Tue, Sep 12, 2017 at 2:51 PM, Abhishek Tiwari wrote: > > There isn't a pre-defined way of doing this. However, you can do either o= f > the following: > > > > - Extend publisher to create a custom publisher and perform the extra ste= p > of writing out this meta > > - Emit Events from writer that contains file names and write / use a new > EventReporter for DB, or use KafkaEventReporter and run another pipeline > for Kafka to DB > > > > Regards, > > Abhishek > > > > On Tue, Sep 12, 2017 at 1:44 AM, Chackravarthy Esakkimuthu < > chaku.mitcs@gmail.com> wrote: > > Hi, > > > > We are using gobblin to ingest data from Kafka to HDFS. > > > > As part of each gobblin run, we want to capture list of files it touched > during that particular run and store those file names (meta) in some DB a= nd > then would like our next subsequent job (pre-processing) to use it. How d= o > I achieve it? > > > > Do I need to have custom class of builder.class ? Or is it supported by > default? can someone help. > > > > Sample job conf file used : > > > > ######## > > job.name=3DGobblinKafkaHDFSJob > > job.group=3DGobblinKafka > > job.description=3DGobblin quick start job for Kafka > > job.lock.enabled=3Dfalse > > kafka.brokers=3Dlocalhost:9092 > > job.schedule=3D0 0/2 * * * ? > > topic.blacklist=3D__consumer_offsets > > source.class=3Dgobblin.source.extractor.extract.kafka.KafkaSimpleSource > > extract.namespace=3Dgobblin.extract.kafka > > writer.builder.class=3Dgobblin.writer.SimpleDataWriterBuilder > > writer.partitioner.class=3Dcom.sample.gobblin.partitioner. > TimeBasedPartitioner > > writer.file.path.type=3Dtablename > > writer.destination.type=3DHDFS > > writer.output.format=3Djson > > simple.writer.delimiter=3D\n > > writer.partition.level=3Dhourly > > writer.partition.pattern=3DYYYY/MM/dd/HH > > writer.partition.timezone=3DAsia/Kolkata > > data.publisher.type=3Dgobblin.publisher.TimePartitionedDataPublisher > > mr.job.max.mappers=3D1 > > metrics.reporting.file.enabled=3Dtrue > > metrics.log.dir=3D/data/gobblin/gobblin-kafka/metrics > > metrics.reporting.file.suffix=3Dtxt > > bootstrap.with.offset=3Dearliest > > fs.uri=3Dhdfs://localhost:8020 > > writer.fs.uri=3Dhdfs://localhost:8020 > > state.store.fs.uri=3Dhdfs://localhost:8020 > > mr.job.root.dir=3D/data/gobblin/gobblin-kafka/working > > state.store.dir=3D/data/gobblin/gobblin-kafka/state-store > > task.data.root.dir=3D/data/gobblin/jobs/kafkaetl/gobblin/ > gobblin-kafka/task-data > > data.publisher.final.dir=3D/data/ingestion > > ############ > > > > Thanks, > > Chackra > > > > > --001a11474486de255805590ec0e0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Sure thanks,

I could able to write meta= data by having custom publisher.

=
On Tue, Sep 12, 2017 at 10:18 PM, Eric Ogren <eogren@linkedin.com> wrote:

Hi there =E2=80=93

=C2=A0

If you are running a fairly new version of Gobblin, = the FsWriterMetrics class might have some of the info you are looking for. = The writer will collect the output filenames (and some other info like # re= cords) and place that it in state. As Abhishek mentioned, you would probably need to create a custom publishe= r to place these metrics in your target DB.

=C2=A0

Eric

=C2=A0

From= : Chackravarthy Esa= kkimuthu <cha= ku.mitcs@gmail.com>
Reply-To: "user@gobblin.incubator.apache.org" <user@go= bblin.incubator.apache.org>
Date: Tuesday, September 12, 2017 at 3:04 AM
To: "user@gobblin.incubator.apache.org" <user@gobbli= n.incubator.apache.org>
Subject: Re: How to capture list of files touched during each gobbli= n run

=C2=A0

Thanks Abhishek,

=C2=A0

yes, having custom publisher I could able to get lis= t of published output dirs. And yes I could make my logic over there.

=C2=A0

One more clarification,

=C2=A0

I am seeing "data.publisher.output.dirs" h= olding list of output dirs (get persisted in State object) ----> close()= in BaseDataPublisher.=C2=A0

Are these state getting stored somewhere already as = part of job output?

=C2=A0

On Tue, Sep 12, 2017 at 2:51 PM, Abhishek Tiwari <= ;abti@apache.org&g= t; wrote:

There isn't a pre-defined way of doing this. How= ever, you can do either of the following:=C2=A0

=C2=A0

- Extend publisher to create a custom publisher and = perform the extra step of writing out this meta=C2=A0

- Emit Events from writer that contains file names a= nd write / use a new EventReporter for DB, or use KafkaEventReporter and ru= n another pipeline for Kafka to DB=C2=A0

=C2=A0

Regards,=C2=A0

Abhishek=

=C2=A0

On Tue, Sep 12, 2017 at 1:44 AM, Chackravarthy Esakk= imuthu <chaku= .mitcs@gmail.com> wrote:

Hi,

=C2=A0=

We are using gobblin= to ingest data from Kafka to HDFS.=C2=A0

=C2=A0=

As part of each gobb= lin run, we want to capture list of files it touched during that particular= run and store those file names (meta) in some DB and then would like our n= ext subsequent job (pre-processing) to use it. How do I achieve it?

=C2=A0=

Do I need to have cu= stom class of builder.class ? Or is it supported by default? can someone he= lp.=C2=A0

=C2=A0=

Sample job conf file= used :

=C2=A0=

########

j= ob.name=3DGobblinKafkaHDFSJob

job.group=3DGobblinKafka

job.description=3DGobblin quick start job for Ka= fka

job.lock.enabled=3Dfalse

kafka.brokers=3Dlocalhost:9092

job.schedule=3D0 0/2 * * * ?

topic.blacklist=3D__consumer_offsets=

source.class=3Dgobblin.source.extractor.ext= ract.kafka.KafkaSimpleSource

extract.namespace=3Dgobblin.extract.kafka

writer.builder.class=3Dgobblin.writer.Simpl= eDataWriterBuilder<= /u>

writer.partitioner.class=3Dcom.sample.gobbl= in.partitioner.TimeBasedPartitioner

writer.file.path.type=3Dtablename

writer.destination.type=3DHDFS

writer.output.format=3Djson

simple.writer.delimiter=3D\n

writer.partition.level=3Dhourly

writer.partition.pattern=3DYYYY/MM/dd/HH

writer.partition.timezone=3DAsia/Kolkata

data.publisher.type=3Dgobblin.publisher.TimePartitionedDataPublisher

mr.job.max.mappers=3D1

metrics.reporting.file.enabled=3Dtrue

metrics.log.dir=3D/data/gobblin/gobblin-kaf= ka/metrics

metrics.reporting.file.suffix=3Dtxt<= /span>

bootstrap.with.offset=3Dearliest

fs.uri=3Dhdfs://localhost:8020

writer.fs.uri=3Dhdfs://localhost:8020

state.store.fs.uri=3Dhdfs://localhost:8020<= /span>

mr.job.root.dir=3D/data/gobblin/gobbli= n-kafka/working

state.store.dir=3D/data/gobblin/gobblin-kaf= ka/state-store<= /span>

task.data.root.dir=3D/data/gobblin/jobs/kaf= kaetl/gobblin/gobblin-kafka/task-data

data.publisher.final.dir=3D/data/ingestion<= /span>

############

=C2=A0

Thanks,

Chackra

=C2=A0

=C2=A0


--001a11474486de255805590ec0e0--