incubator-chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From hdev ml <hde...@gmail.com>
Subject Re: Seeking a little advice
Date Tue, 24 Aug 2010 20:56:01 GMT
Thanks Jerome for the quick answer.

1&2. We are not sure if we need more than one machine. But I think the data
size is large, so my guess is we might need that in the future.

My personal thought is that, if we have hadoop platform in the company, it
may be helpful for some other large batch processing. e.g. We also want to
do data mining and there is Apache Mahout project which leverages hadoop
capabilities to do that.

The raw data is in text format, but it may or may not be converted to
database before my module kicks in to process the data. The size of the data
is approximately 40-50GB per day and it is archived for a month or so. So
total data for a month would be around 1.2 - 1.5 TB

Again thanks for your time and efforts.

- Harshad


On Tue, Aug 24, 2010 at 1:00 PM, Jerome Boulon <jboulon@netflix.com> wrote:

>  If the data is in 1 machine then there’s probably no need to move the
> data.
> So the question is more:
>
>    - Do you need more than one machine to do your ETL?
>    - Would you ever need more than one machine?
>
>
> So if you need more than 1 machine then chukwa could be the right answer.
> I have a tool that I could publish to transform any input file to Chukwa
> compressed dataSink file. This could be a first step.
> Also hadoop has a JDBC InputReader/Writer so you may want to take a look.
>
> Could you give more info on your data(size and ETL)?
>
> /Jerome.
>
>
> On 8/24/10 12:39 PM, "hdev ml" <hdevml@gmail.com> wrote:
>
> HI all,
>
> This question is related partly to hadoop and partly to chukwa.
>
> We have huge number of logged information sitting in one machine. I am not
> sure whether the storage is in multiple files or in a database.
>
> But what we want to do is get that log information, transform it and store
> it into the some database for data mining/ data warehousing/ reporting
> purposes.
>
> 1. Since it is on one machine, is Chukwa the right kind of frame work to do
> this ETL process?
>
> 2. I understand that generally Hadoop works on large files. But assuming
> that the data sits in a database, what if we somehow partition data for
> Hadoop/Chukwa? Is that the right strategy?
>
> Any help will be appreciated.
>
> Thanks,
>
> Harshad
>
>

Mime
View raw message