lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Rafalovitch <arafa...@gmail.com>
Subject Re: Indexing Twitter - Hypothetical
Date Thu, 03 Mar 2016 23:34:59 GMT
I think some of the Twitter's need to index in a particular way comes
from their real-time need. So, that's part of the decision for the
original poster, on how responsive data needs to be.

As to the rest, I think the company that shows twitter messages on TV
does something similar with Solr. They were presenting at Revolution
2014 (one before last) I think.  I forgot their name (they changed it
once or twice...)

Regards,
    Alex.
----
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 4 March 2016 at 06:25, Toke Eskildsen <te@statsbiblioteket.dk> wrote:
> Joseph Obernberger <joseph.obernberger@gmail.com> wrote:
>> Hi All - would it be reasonable to index the Twitter 'firehose' with Solr
>> Cloud - roughly 500-600 million docs per day indexing each of the fields
>> (about 180)?
>
> Possible, yes. Reasonable? It is not going to be cheap.
>
> Twitter index the tweets themselves and have been quite open about how they do it. I
would suggest looking for their presentations; slides or recordings. They have presented at
Berlin Buzzwords and Lucene/Solr Revolution and probably elsewhere too. The gist is that they
have done a lot of work and custom coding to handle it.
>
>> If I were to guess at a sharded setup to handle such data, and keep 2 years
>> worth, I would guess about 2500 shards.  Is that reasonable?
>
> I think you need to think well beyond standard SolrCloud setups. Even if you manage to
get 2500 shards running, you will want to do a lot of tweaking on the way to issue queries
so that each request does not require all 2500 shards to be searched. Prioritizing newer material
and only query the older shards if there is not enough resent results is an example.
>
> I highly doubt that a single SolrCloud is the best answer here. Maybe one cloud for each
month and a lot of external logic?
>
> - Toke Eskildsen

Mime
View raw message