lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Noble Paul നോബിള്‍ नोब्ळ्" <noble.p...@gmail.com>
Subject Re: Solr for large volume data processing with minimal full-text serach
Date Fri, 07 Nov 2008 17:27:25 GMT
If you need anything close to realtime (~ few seconds) hadoop and its
ilk is not a choice. Solr is fine. But be prepared to dedicate a lot
of hardware for that

On Fri, Nov 7, 2008 at 10:53 PM, souravm <SOURAVM@infosys.com> wrote:
> Hi Shalin,
>
> Thanks for your input.
>
> Yes I agree that my application is not much about full text search.
>
> Hive/Chukwa/Pig (a combination) running on Hadoop can be a good bet. But where they fall
short is in online querying of the huge data.
>
> I am specifically talking about Pig in this case which has benchmarking figure in the
order of 3-10 minutes with 11 nodes for around 4GB data size (200 M records). Where as for
Solr I can see processing time is under second at 1 node (but higher memory) for around 1
GB data size (0.5 M records).
>
> Since for my application online query performance is one of the key requirement (I think
irrespective of type of application no user would like to wait on the screen for more than
a minute) I'm in dilemma.
>
> Regards,
> Sourav
>
>
>
> -----Original Message-----
> From: Shalin Shekhar Mangar [mailto:shalinmangar@gmail.com]
> Sent: Friday, November 07, 2008 7:48 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Multicore ...
>
> From what I can understand, you have little full-text search involved here.
> You should probably look at Hadoop and its contrib and sub-projects such as
> Pig, Hive and Chukwa.
>
> http://wiki.apache.org/hadoop/
> http://wiki.apache.org/hadoop/Hive
> http://wiki.apache.org/hadoop/Chukwa
> http://incubator.apache.org/pig/
>
> On Fri, Nov 7, 2008 at 9:03 PM, souravm <SOURAVM@infosys.com> wrote:
>
>> Hi Guys,
>>
>> Here I'm struggling with to decide whether Solr would be a fitting solution
>> for me. Highly appreciate you
>>
>> The key requirements can be summarized as below -
>>
>> 1. Need to process very high volume of data online from log files of
>> various applications - around 100s of Millions of total size may be varying
>> within a range of 30-40 GB.
>>
>> 2. Flexibility - Log file formats from different applications would be
>> different. Also for the same application log file formats can vary. However,
>> the log files would be in xml and if a new type has to be supported then the
>> schema for the same would be known before hand.
>>
>> 3. The type of queries to be supported -
>> a) Mostly aggregation type statistics (min, max, average, sd, count etc.)
>> of response times, sales numbers etc.
>> b) Ability to support adhoc queries relating multiple fields in a given
>> logfile, joining similar fields in multiple logfiles
>>
>> 4. Flexibility - Log file formats from different applications would be
>> different. Also for the same application log file formats can vary. However,
>> the log files would be in xml and if a new type has to be supported then the
>> schema for the same would be known before hand.
>>
>> 5. Expected performance would be around 10 to 20 sec for majority of the
>> queries. For rest it may be a bit more higher.
>>
>> I'm planning to use Solr with multicore and distributed search feature.
>> However also considering Hadoop with Hbase as that looks to be a natural
>> solution to support multiple file formats and handling adhoc queries.
>>
>> I would surely like to have your viewpoints on this regard - whether given
>> the key requirements above Solr is a right choice or Hadoop+HBase would be
>> better (or any other open source product).
>>
>> Thanks in advance.
>>
>> Regards,
>> Sourav
>>
>> **************** CAUTION - Disclaimer *****************
>> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
>> solely
>> for the use of the addressee(s). If you are not the intended recipient,
>> please
>> notify the sender by e-mail and delete the original message. Further, you
>> are not
>> to copy, disclose, or distribute this e-mail or its contents to any other
>> person and
>> any such actions are unlawful. This e-mail may contain viruses. Infosys has
>> taken
>> every reasonable precaution to minimize this risk, but is not liable for
>> any damage
>> you may sustain as a result of any virus in this e-mail. You should carry
>> out your
>> own virus checks before opening the e-mail or attachment. Infosys reserves
>> the
>> right to monitor and review the content of all messages sent to or from
>> this e-mail
>> address. Messages sent to or from this e-mail address may be stored on the
>> Infosys e-mail system.
>> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
--Noble Paul

Mime
View raw message