gobblin-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dominique De Vito <ddv36...@gmail.com>
Subject Re: few short questions to better understand Gobblin scope
Date Wed, 24 Jan 2018 23:54:50 GMT
Thanks Abhishek


>Slight correction:

>Gobblin = <source> => <WorkUnit_1>, <WorkUnit_2>, .... => <target>


>i.e Workunits are independent of each other and division of overall work.
They are not steps of the process.

>Each workunit executes the following steps:

>extractor => conveter => quality checker => fork operator => writer

ok, so I understand:

Gobblin = <source> => <WorkUnit_1>, <WorkUnit_2>, .... => <target>

like the following:

Gobblin = <source> => <WorkUnit_1>   => <target> in parallel with
+ <source> => <WorkUnit_2>, .... => <target> in parallel with
+ ....

===> please, correct me if I am wrong.

IMHO short path for ingestion (like "<source> => <WorkUnit>   => <target>")
makes more sense.

Faster ingestion (and then, shorter path) makes more sense, because if data
are available ASAP in (let's say) HDFS, then faster (because parallel)
treatment could happen next (in for example Spark)

Nice fit with Gobblin (AFAIU).

Thanks.

Dominique



2018-01-24 23:48 GMT+01:00 Abhishek Tiwari <abti@apache.org>:

> Hi Dominique,
>
> Please find my answers inline.
>
> On Wed, Jan 24, 2018 at 7:35 AM, Dominique De Vito <ddv36a78@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have digged a bit into the Gobblin web site. Here below are few
>> questions the way I understand Gobblin so far:
>>
>>
>> 1) Gobblin seems to use only configuration files => no GUI so far ?
>>
>> We have UI (gobblin-admin) but that does not lets you configure jobs,
> only view running jobs and their status / history.
>
>>
>> 2) Gobblin seems to support the workflow concept as the following:
>>
>> -- Gobblin = <source> => <WorkUnit_1> =>  .....<WorkUnit_N>
=> <target>
>>
>>
>> One picture (in the "Architecture" page) shows this WorkUnit line (inside
>> a task) :
>>
>>                 extractor => converter => quality checker => fork
>> operator
>>
>> Is there an order to respect (one kind of WorkUnit, say "converter"
>> __strictly__ after another kind of WorkUnit, say "quality checker") ?
>>
>> Or, this order has been only crafted for presentation (in the
>> "Architecture" page), and there is no strict order between WorkUnit kinds
>> (for example, one may imagine a converter after quality checker, and not
>> only before like just above) ?
>>
>>
> Slight correction:
> Gobblin = <source> => <WorkUnit_1>, <WorkUnit_2>, .... => <target>
>
> i.e Workunits are independent of each other and division of overall work.
> They are not steps of the process.
> Each workunit executes the following steps:
> extractor => conveter => quality checker => fork operator => writer
>
>
>
>> Thanks
>>
>> Regards,
>> Dominique
>>
>>
>>
>>
>

Mime
View raw message