lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Krupansky <>
Subject Re: Indexing Twitter - Hypothetical
Date Thu, 03 Mar 2016 20:51:47 GMT
As always, the initial question always needs to be how you wish to query
the data - query will drive the data model. I don't  want to put words in
your mouth as to your query requirements, so... clue us in on your query

-- Jack Krupansky

On Thu, Mar 3, 2016 at 2:25 PM, Toke Eskildsen <>

> Joseph Obernberger <> wrote:
> > Hi All - would it be reasonable to index the Twitter 'firehose' with Solr
> > Cloud - roughly 500-600 million docs per day indexing each of the fields
> > (about 180)?
> Possible, yes. Reasonable? It is not going to be cheap.
> Twitter index the tweets themselves and have been quite open about how
> they do it. I would suggest looking for their presentations; slides or
> recordings. They have presented at Berlin Buzzwords and Lucene/Solr
> Revolution and probably elsewhere too. The gist is that they have done a
> lot of work and custom coding to handle it.
> > If I were to guess at a sharded setup to handle such data, and keep 2
> years
> > worth, I would guess about 2500 shards.  Is that reasonable?
> I think you need to think well beyond standard SolrCloud setups. Even if
> you manage to get 2500 shards running, you will want to do a lot of
> tweaking on the way to issue queries so that each request does not require
> all 2500 shards to be searched. Prioritizing newer material and only query
> the older shards if there is not enough resent results is an example.
> I highly doubt that a single SolrCloud is the best answer here. Maybe one
> cloud for each month and a lot of external logic?
> - Toke Eskildsen

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message