hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashidhar Rao <raoshashidhar...@gmail.com>
Subject Re: XML files in Hadoop
Date Sat, 03 Jan 2015 16:33:17 GMT
Sorry , not Hive files but xml files to some Avro format and store these
into Hive will be fast .

On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <raoshashidhar123@gmail.com>
wrote:

> Hi,
>
> Exact number of files is not known but it will run into millions of files
> depending on client's request who collects terabytes of xml data every day.
> Basically, storing is just one part but the main part will be how to query
> these data like  aggregation, count and do some analytics over these data.
> Fast retrieval is required , say for e.g for a particular year what are the
> top 10 products, top ten manufacturers and top ten stores etc.
>
> Will Hive be a better choice ? And will converting these Hive files to
> some format work out.
>
> Thanks
> Shashi
>
> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <wilm.schumacher@gmail.com
> > wrote:
>
>> Hi,
>>
>> how many xml files are you planning to store? Perhaps it is possible to
>> store them directly on hdfs and save meta data in hbase. This sounds
>> more reasonable to me.
>>
>> If the number of xml files is to large (millions and billions), then you
>> can use hadoop map files to put files together. E.g. based on years, or
>> month.
>>
>> Regards,
>>
>> Wilm
>>
>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>> > Hi,
>> >
>> > Can someone help me by suggesting the best way to solve this use case
>> >
>> > 1. XML files keep flowing from external system and need to be stored
>> > into HDFS.
>> > 2. These files  can be directly stored using NoSql database e.g any
>> > xml supported NoSql. or
>> > 3. These files need to be processed and stored in one of the database
>> > HBase, Hive etc.
>> > 4. There won't be any updates only read and has to be retrieved based
>> > on some queries and a dashboard has to be created , bits of analytics
>> >
>> > The xml files are huge and expected number of nodes is roughly around
>> > 12 nodes.
>> > I am stuck in the storage part say if I convert xml to json and store
>> > it into HBase , the processing part from xml to json will be huge.
>> >
>> > It will be only reading and no updates.
>> >
>> > Please suggest how to store these xml files.
>> >
>> > Thanks
>> > Shashi
>>
>>
>

Mime
View raw message