hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric <eric.x...@gmail.com>
Subject Re: storing logs in hbase
Date Mon, 06 Feb 2012 15:46:05 GMT
It sounds to me that you are better off using Hive. HBase is suitable for
real time access to specific records. If you want to do batch processing
(Map Reduce) on your data, like you said yourself, then Hive removes all
the HBase overhead and gives you a powerful query language to search
through your data. You can also use Pig and e.g. Wonderdog to index your
data in ElasticSearch.

2012/2/5 Doug Meil <doug.meil@explorysmedical.com>

> ... but it depends on what you want to do.  If you want full-text
> searching, then yes, you probably want to look at Lucene.  If you want
> activity analysis, summaries are probably better.
> On 2/5/12 1:54 PM, "Doug Meil" <doug.meil@explorysmedical.com> wrote:
> >
> >Hi there-
> >
> >You probably want to check out these chapters of the Hbase ref guide:
> >
> >http://hbase.apache.org/book.html#datamodel
> >http://hbase.apache.org/book.html#schema
> >http://hbase.apache.org/book.html#mapreduce
> >
> >... and with respect to the "40 minutes per report", a common pattern is
> >to create summary table/files as appropriate.
> >
> >
> >
> >
> >On 2/5/12 3:37 AM, "mete" <efkarr@gmail.com> wrote:
> >
> >>Hello,
> >>
> >>i am thinking about using hbase for storing web log data, i like the idea
> >>to have hdfs underneath so that i wont be worried about failure cases
> >>much
> >>and i can benefit from all the cool HBase features.
> >>
> >>The thing i could not figure out is howto effectively store and query the
> >>data.I am planning to split each kind of log record to 10 - 20 columns
> >>and
> >>then use MR jobs query the table with full scans.
> >>(I guess i can use hive or pig for this as well but i am not familiar
> >>with
> >>those yet)
> >>I find this approach simple and easy to implement but on the other hand
> >>this is like an offline process, it could take a lot of time to get a
> >>single report. And of course a business user would be very dissappointed
> >>to
> >>see that he/she has to wait another 40 mins to get the results of the
> >>query.
> >>
> >>So what i am trying to achieve is to keep this query time as small as
> >>possible. For this i can sacrifice the write speed as well, i dont really
> >>have to integrate new logs on-the-fly but a job that runs overnight is
> >>also
> >>fine.
> >>
> >>So for this kind of situation do you find Hbase useful?
> >>
> >>I read about star-schema design to make more effective queries but then
> >>this makes the developers job a lot more harder because i need to design
> >>different schemas for different log types, adding a new log type would
> >>require some time to gather requirements,develop etc...
> >>
> >>I thought about creating a very simple hbase shema, like just a key and
> >>the
> >>content for each record, and then index this content with lucene, but
> >>then
> >>this sounded like i did not need hbase in the first place because i am
> >>not
> >>really benefiting from it except for storage.Also i could not be sure
> >>about
> >>how big my lucene indexes would get, and if i could cope up with bigdata
> >>on
> >>lucene. What do you think about lucene indexes on hbase?
> >>
> >>I read about how rackspace was doing things, as far as i understood they
> >>are generating lucene indexes while parsing the logs in hadoop, and then
> >>merging this index into some system that is serving the previous
> >>indexes.(
> >>
> http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-qu
> >>e
> >>ry-terabytes-data)
> >>
> >>Does anyone use a similar approach or have any ideas about this?
> >>
> >>Do you think any of these are suitable? or if not should i try a
> >>different
> >>way?
> >>
> >>Thanks in advance
> >>Mete
> >
> >
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message