hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashidhar Rao <raoshashidhar...@gmail.com>
Subject Re: XML files in Hadoop
Date Sat, 03 Jan 2015 17:27:39 GMT
Hi Peyman,

Sure , will try using two Hive tables for the conversion.
It was awesome discussing with you . Thanks a lot.


Shashi

On Sat, Jan 3, 2015 at 10:53 PM, Peyman Mohajerian <mohajeri@gmail.com>
wrote:

> I would recommend as the first step not to use Flume, but rather land the
> data in hdfs in the source format, XML and use Hive to convert the format
> from XML to Parquet. That is much simpler to do than using Flume. Flume
> only makes sense if you don't care for the original file format and want to
> ingest the data fast, meet some SLA.
> Flume has a good user guide page if you google it.
> In Hive you need two tables, one that reads XML data using XML serd
> (external table), a second one that is Parquet format, you do insert into
> the second table from the source, and that will easily do the format
> conversion.
>
> On Sat, Jan 3, 2015 at 9:16 AM, Shashidhar Rao <raoshashidhar123@gmail.com
> > wrote:
>
>> Hi Peyman,
>>
>> Really appreciate your suggestion.
>> But say , if Tableau has to be used to generate reports then Tableau
>> works great with Hive.
>>
>> Just one more question, can flume be used to convert xml data to parquet ?
>> I will store these into Hive as parquet and generate reports using
>> Tableau.
>>
>> If flume can convert xml to parquet , do I need external tools , can you
>> please provide me some links on how to convert xml to parquet using flume.
>> Because , Predictive analytics may be used on Hive data in the end phase of
>> the project.
>>
>> Thanks
>> Shashi
>>
>> On Sat, Jan 3, 2015 at 10:32 PM, Peyman Mohajerian <mohajeri@gmail.com>
>> wrote:
>>
>>> Hi Shashi,
>>> Sure you can use json instead of Parquet, I was thinking in terms of
>>> using Hive for processing the data, but if you'd like to use Drill (which i
>>> heard is a good choice), then just convert the data from to json. You don't
>>> have to deal with parquet or Hive in that case, just use Flume to convert
>>> XML to json (there are many other choices to do that within the cluster
>>> too) and then use Drill to read and process the data.
>>>
>>> Thanks,
>>> Peyman
>>>
>>>
>>> On Sat, Jan 3, 2015 at 8:53 AM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>> Hi Peyman,
>>>>
>>>> Thanks a lot for your suggestions, really appreciate and got some idea
>>>> from your suggestions. Here's what I want to proceed.
>>>> 1.  Using Flume convert xml to JSON/Parquet before it reaches HDFS.
>>>> 2.  Store parquet converted files into Hive.
>>>> 3.  Query using Apache Drill in SQL dialect.
>>>>
>>>> But one thing can you please help me if instead of converting to
>>>> parquet if I convert into json and store in Hive as Parquet format , is
>>>> this a feasible option.
>>>> The reason I want to convert to json is that Apache Drill works very
>>>> well with JSON format.
>>>>
>>>> Thanks
>>>> Shashi
>>>>
>>>> On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian <mohajeri@gmail.com>
>>>> wrote:
>>>>
>>>>> You can land the data in HDFS as XML files and use 'hive xml serde' to
>>>>> read the data and write it back in a more optimal format, e.g. ORC or
>>>>> parquet (depending somewhat on your choice of Hadoop distro). Querying
XML
>>>>> data directly via Hive is also doable but slow. Converting to Avro is
also
>>>>> doable but in my experience not as fast as ORC or Parquet. Columnar formats
>>>>> work give you better performance but Avro has its own strength, e.g.
>>>>> managing schema changes better.
>>>>> You can also convert the format before you land the data in HDFS, e.g.
>>>>> using Flume or some other tool for changing the format in flight.
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <
>>>>> raoshashidhar123@gmail.com> wrote:
>>>>>
>>>>>> Sorry , not Hive files but xml files to some Avro format and store
>>>>>> these into Hive will be fast .
>>>>>>
>>>>>> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <
>>>>>> raoshashidhar123@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Exact number of files is not known but it will run into millions
of
>>>>>>> files depending on client's request who collects terabytes of
xml data
>>>>>>> every day. Basically, storing is just one part but the main part
will be
>>>>>>> how to query these data like  aggregation, count and do some
analytics over
>>>>>>> these data. Fast retrieval is required , say for e.g for a particular
year
>>>>>>> what are the top 10 products, top ten manufacturers and top ten
stores etc.
>>>>>>>
>>>>>>> Will Hive be a better choice ? And will converting these Hive
files
>>>>>>> to some format work out.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Shashi
>>>>>>>
>>>>>>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>>>>>>> wilm.schumacher@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> how many xml files are you planning to store? Perhaps it
is
>>>>>>>> possible to
>>>>>>>> store them directly on hdfs and save meta data in hbase.
This sounds
>>>>>>>> more reasonable to me.
>>>>>>>>
>>>>>>>> If the number of xml files is to large (millions and billions),
>>>>>>>> then you
>>>>>>>> can use hadoop map files to put files together. E.g. based
on
>>>>>>>> years, or
>>>>>>>> month.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Wilm
>>>>>>>>
>>>>>>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>>>>>>> > Hi,
>>>>>>>> >
>>>>>>>> > Can someone help me by suggesting the best way to solve
this use
>>>>>>>> case
>>>>>>>> >
>>>>>>>> > 1. XML files keep flowing from external system and need
to be
>>>>>>>> stored
>>>>>>>> > into HDFS.
>>>>>>>> > 2. These files  can be directly stored using NoSql database
e.g
>>>>>>>> any
>>>>>>>> > xml supported NoSql. or
>>>>>>>> > 3. These files need to be processed and stored in one
of the
>>>>>>>> database
>>>>>>>> > HBase, Hive etc.
>>>>>>>> > 4. There won't be any updates only read and has to be
retrieved
>>>>>>>> based
>>>>>>>> > on some queries and a dashboard has to be created ,
bits of
>>>>>>>> analytics
>>>>>>>> >
>>>>>>>> > The xml files are huge and expected number of nodes
is roughly
>>>>>>>> around
>>>>>>>> > 12 nodes.
>>>>>>>> > I am stuck in the storage part say if I convert xml
to json and
>>>>>>>> store
>>>>>>>> > it into HBase , the processing part from xml to json
will be huge.
>>>>>>>> >
>>>>>>>> > It will be only reading and no updates.
>>>>>>>> >
>>>>>>>> > Please suggest how to store these xml files.
>>>>>>>> >
>>>>>>>> > Thanks
>>>>>>>> > Shashi
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message