lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charlie Hull <>
Subject Re: Indexing Twitter - Hypothetical
Date Fri, 04 Mar 2016 09:19:37 GMT
On 03/03/2016 19:25, Toke Eskildsen wrote:
> Joseph Obernberger <> wrote:
>> Hi All - would it be reasonable to index the Twitter 'firehose'
>> with Solr Cloud - roughly 500-600 million docs per day indexing
>> each of the fields (about 180)?
> Possible, yes. Reasonable? It is not going to be cheap.
> Twitter index the tweets themselves and have been quite open about
> how they do it. I would suggest looking for their presentations;
> slides or recordings. They have presented at Berlin Buzzwords and
> Lucene/Solr Revolution and probably elsewhere too. The gist is that
> they have done a lot of work and custom coding to handle it.

As I recall they're not using Solr, but rather an in-house layer built 
on a customised version of Lucene. They're indexing around half a 
trillion tweets.

If the idea is to provide a searchable archive of all tweets, my first 
question would be 'why': if the idea is to monitor new tweets for 
particular patterns there are better ways to do this (Luwak for example).

>> If I were to guess at a sharded setup to handle such data, and keep
>> 2 years worth, I would guess about 2500 shards.  Is that
>> reasonable?
> I think you need to think well beyond standard SolrCloud setups. Even
> if you manage to get 2500 shards running, you will want to do a lot
> of tweaking on the way to issue queries so that each request does not
> require all 2500 shards to be searched. Prioritizing newer material
> and only query the older shards if there is not enough resent results
> is an example.
> I highly doubt that a single SolrCloud is the best answer here. Maybe
> one cloud for each month and a lot of external logic?
> - Toke Eskildsen

Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828

View raw message