hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Omer Trajman" <o...@vertica.com>
Subject RE: indexing log files for adhoc queries - suggestions?
Date Sat, 03 Oct 2009 12:43:13 GMT
You might consider loading logs to a parallel database for the ad-hoc queries (full disclosure,
I work for a database company).

For repeated ad-hoc queries, a distributed database will give you the scalability of hdfs
and also structure the data to handle fast predicates and relational aggregates.

-Omer


-----Original Message-----
From: Amandeep Khurana <amansk@gmail.com>
Sent: Saturday, October 03, 2009 04:07
To: common-user@hadoop.apache.org <common-user@hadoop.apache.org>
Subject: Re: indexing log files for adhoc queries - suggestions?

Hbase is built on hdfs but just to read records from it, you don't
need map reduce. So, its possible to access it real time. The .20
release compares to mysql as far as random reads go...

I haven't heard of hive talking to hbase yet. But that'll be a good
feature to have for sure.

On 10/2/09, Otis Gospodnetic <otis_gospodnetic@yahoo.com> wrote:
> My understanding is that *no* tools built on top of MapReduce (Hive, Pig,
> Cascading, CloudBase...) can be real-time where real-time is something that
> processes the data and produces output in under 5 seconds or so.
>
> I believe Hive can read HBase now, too.
>
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> ----- Original Message ----
>> From: Amandeep Khurana <amansk@gmail.com>
>> To: common-user@hadoop.apache.org
>> Sent: Saturday, October 3, 2009 1:18:57 AM
>> Subject: Re: indexing log files for adhoc queries - suggestions?
>>
>> There's another option - cascading.
>>
>> With pig and cascading you can use hbase as a backend. So that might
>> be something you can explore too... The choice will depend on what
>> kind of querying you want to do - real time or batch processed.
>>
>> On 10/2/09, Otis Gospodnetic wrote:
>> > Use Pig or Hive.  Lots of overlap, some differences, but it looks like
>> > both
>> > projects' future plans mean even more overlap, though I didn't hear any
>> > mentions of convergence and merging.
>> >
>> > Otis
>> > --
>> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>> >
>> >
>> >
>> > ----- Original Message ----
>> >> From: Amandeep Khurana
>> >> To: common-user@hadoop.apache.org
>> >> Sent: Friday, October 2, 2009 6:28:51 PM
>> >> Subject: Re: indexing log files for adhoc queries - suggestions?
>> >>
>> >> Hive is an sql-like abstraction over map reduce. It just enables you
>> >> to execute sql-like queries over data without actually having to write
>> >> the MR job. However it converts the query into a job at the back.
>> >>
>> >> Hbase might be what you are looking for. You can put your logs into
>> >> hbase and query them as well as run MR jobs over them...
>> >>
>> >> On 10/1/09, Mayuran Yogarajah wrote:
>> >> > ishwar ramani wrote:
>> >> >> Hi,
>> >> >>
>> >> >> I have a setup where logs are periodically bundled up and dumped
>> >> >> into
>> >> >> hadoop dfs as large sequence file.
>> >> >>
>> >> >> It works fine for all my map reduce jobs.
>> >> >>
>> >> >> Now i need to handle adhoc queries for pulling out logs based on
>> >> >> user
>> >> >> and time range.
>> >> >>
>> >> >> I really dont need a full indexer (like lucene) for this purpose.
>> >> >>
>> >> >> My first thought is to run a periodic mapreduce to generate a large
>> >> >> text file sorted by user id.
>> >> >>
>> >> >> The text file will have (sequence file name, offset) to retrieve
the
>> >> >> logs
>> >> >> ....
>> >> >>
>> >> >>
>> >> >> I am guessing many of you ran into similar requirements... Any
>> >> >> suggestions on doing this better?
>> >> >>
>> >> >> ishwar
>> >> >>
>> >> > Have you looked into Hive? Its perfect for ad hoc queries..
>> >> >
>> >> > M
>> >> >
>> >>
>> >>
>> >> --
>> >>
>> >>
>> >> Amandeep Khurana
>> >> Computer Science Graduate Student
>> >> University of California, Santa Cruz
>> >
>> >
>>
>>
>> --
>>
>>
>> Amandeep Khurana
>> Computer Science Graduate Student
>> University of California, Santa Cruz
>
>


-- 


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

Mime
View raw message