gobblin-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chackravarthy Esakkimuthu <chaku.mi...@gmail.com>
Subject Re: How to capture list of files touched during each gobblin run
Date Wed, 13 Sep 2017 09:26:23 GMT
Sure thanks,

I could able to write metadata by having custom publisher.

On Tue, Sep 12, 2017 at 10:18 PM, Eric Ogren <eogren@linkedin.com> wrote:

> Hi there –
>
>
>
> If you are running a fairly new version of Gobblin, the FsWriterMetrics
> class might have some of the info you are looking for. The writer will
> collect the output filenames (and some other info like # records) and place
> that it in state. As Abhishek mentioned, you would probably need to create
> a custom publisher to place these metrics in your target DB.
>
>
>
> Eric
>
>
>
> *From: *Chackravarthy Esakkimuthu <chaku.mitcs@gmail.com>
> *Reply-To: *"user@gobblin.incubator.apache.org" <user@gobblin.incubator.
> apache.org>
> *Date: *Tuesday, September 12, 2017 at 3:04 AM
> *To: *"user@gobblin.incubator.apache.org" <user@gobblin.incubator.
> apache.org>
> *Subject: *Re: How to capture list of files touched during each gobblin
> run
>
>
>
> Thanks Abhishek,
>
>
>
> yes, having custom publisher I could able to get list of published output
> dirs. And yes I could make my logic over there.
>
>
>
> One more clarification,
>
>
>
> I am seeing "data.publisher.output.dirs" holding list of output dirs (get
> persisted in State object) ----> close() in BaseDataPublisher.
>
> Are these state getting stored somewhere already as part of job output?
>
>
>
> On Tue, Sep 12, 2017 at 2:51 PM, Abhishek Tiwari <abti@apache.org> wrote:
>
> There isn't a pre-defined way of doing this. However, you can do either of
> the following:
>
>
>
> - Extend publisher to create a custom publisher and perform the extra step
> of writing out this meta
>
> - Emit Events from writer that contains file names and write / use a new
> EventReporter for DB, or use KafkaEventReporter and run another pipeline
> for Kafka to DB
>
>
>
> Regards,
>
> Abhishek
>
>
>
> On Tue, Sep 12, 2017 at 1:44 AM, Chackravarthy Esakkimuthu <
> chaku.mitcs@gmail.com> wrote:
>
> Hi,
>
>
>
> We are using gobblin to ingest data from Kafka to HDFS.
>
>
>
> As part of each gobblin run, we want to capture list of files it touched
> during that particular run and store those file names (meta) in some DB and
> then would like our next subsequent job (pre-processing) to use it. How do
> I achieve it?
>
>
>
> Do I need to have custom class of builder.class ? Or is it supported by
> default? can someone help.
>
>
>
> Sample job conf file used :
>
>
>
> ########
>
> job.name=GobblinKafkaHDFSJob
>
> job.group=GobblinKafka
>
> job.description=Gobblin quick start job for Kafka
>
> job.lock.enabled=false
>
> kafka.brokers=localhost:9092
>
> job.schedule=0 0/2 * * * ?
>
> topic.blacklist=__consumer_offsets
>
> source.class=gobblin.source.extractor.extract.kafka.KafkaSimpleSource
>
> extract.namespace=gobblin.extract.kafka
>
> writer.builder.class=gobblin.writer.SimpleDataWriterBuilder
>
> writer.partitioner.class=com.sample.gobblin.partitioner.
> TimeBasedPartitioner
>
> writer.file.path.type=tablename
>
> writer.destination.type=HDFS
>
> writer.output.format=json
>
> simple.writer.delimiter=\n
>
> writer.partition.level=hourly
>
> writer.partition.pattern=YYYY/MM/dd/HH
>
> writer.partition.timezone=Asia/Kolkata
>
> data.publisher.type=gobblin.publisher.TimePartitionedDataPublisher
>
> mr.job.max.mappers=1
>
> metrics.reporting.file.enabled=true
>
> metrics.log.dir=/data/gobblin/gobblin-kafka/metrics
>
> metrics.reporting.file.suffix=txt
>
> bootstrap.with.offset=earliest
>
> fs.uri=hdfs://localhost:8020
>
> writer.fs.uri=hdfs://localhost:8020
>
> state.store.fs.uri=hdfs://localhost:8020
>
> mr.job.root.dir=/data/gobblin/gobblin-kafka/working
>
> state.store.dir=/data/gobblin/gobblin-kafka/state-store
>
> task.data.root.dir=/data/gobblin/jobs/kafkaetl/gobblin/
> gobblin-kafka/task-data
>
> data.publisher.final.dir=/data/ingestion
>
> ############
>
>
>
> Thanks,
>
> Chackra
>
>
>
>
>

Mime
View raw message