gobblin-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abhishek Tiwari <a...@apache.org>
Subject Re: How to capture list of files touched during each gobblin run
Date Tue, 12 Sep 2017 09:21:15 GMT
There isn't a pre-defined way of doing this. However, you can do either of
the following:

- Extend publisher to create a custom publisher and perform the extra step
of writing out this meta
- Emit Events from writer that contains file names and write / use a new
EventReporter for DB, or use KafkaEventReporter and run another pipeline
for Kafka to DB

Regards,
Abhishek

On Tue, Sep 12, 2017 at 1:44 AM, Chackravarthy Esakkimuthu <
chaku.mitcs@gmail.com> wrote:

> Hi,
>
> We are using gobblin to ingest data from Kafka to HDFS.
>
> As part of each gobblin run, we want to capture list of files it touched
> during that particular run and store those file names (meta) in some DB and
> then would like our next subsequent job (pre-processing) to use it. How do
> I achieve it?
>
> Do I need to have custom class of builder.class ? Or is it supported by
> default? can someone help.
>
> Sample job conf file used :
>
> ########
>
> job.name=GobblinKafkaHDFSJob
>
> job.group=GobblinKafka
>
> job.description=Gobblin quick start job for Kafka
>
> job.lock.enabled=false
>
> kafka.brokers=localhost:9092
>
> job.schedule=0 0/2 * * * ?
>
> topic.blacklist=__consumer_offsets
>
> source.class=gobblin.source.extractor.extract.kafka.KafkaSimpleSource
>
> extract.namespace=gobblin.extract.kafka
>
> writer.builder.class=gobblin.writer.SimpleDataWriterBuilder
>
> writer.partitioner.class=com.sample.gobblin.partitioner.Time
> BasedPartitioner
>
> writer.file.path.type=tablename
>
> writer.destination.type=HDFS
>
> writer.output.format=json
>
> simple.writer.delimiter=\n
>
> writer.partition.level=hourly
>
> writer.partition.pattern=YYYY/MM/dd/HH
>
> writer.partition.timezone=Asia/Kolkata
>
> data.publisher.type=gobblin.publisher.TimePartitionedDataPublisher
>
> mr.job.max.mappers=1
>
> metrics.reporting.file.enabled=true
>
> metrics.log.dir=/data/gobblin/gobblin-kafka/metrics
>
> metrics.reporting.file.suffix=txt
>
> bootstrap.with.offset=earliest
>
> fs.uri=hdfs://localhost:8020
>
> writer.fs.uri=hdfs://localhost:8020
>
> state.store.fs.uri=hdfs://localhost:8020
>
> mr.job.root.dir=/data/gobblin/gobblin-kafka/working
>
> state.store.dir=/data/gobblin/gobblin-kafka/state-store
>
> task.data.root.dir=/data/gobblin/jobs/kafkaetl/gobblin/gobbl
> in-kafka/task-data
>
> data.publisher.final.dir=/data/ingestion
>
> ############
>
>
> Thanks,
>
> Chackra
>

Mime
View raw message