lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Krupansky <jack.krupan...@gmail.com>
Subject Re: Indexing Twitter - Hypothetical
Date Thu, 03 Mar 2016 20:51:47 GMT
As always, the initial question always needs to be how you wish to query
the data - query will drive the data model. I don't  want to put words in
your mouth as to your query requirements, so... clue us in on your query
requirements.

-- Jack Krupansky

On Thu, Mar 3, 2016 at 2:25 PM, Toke Eskildsen <te@statsbiblioteket.dk>
wrote:

> Joseph Obernberger <joseph.obernberger@gmail.com> wrote:
> > Hi All - would it be reasonable to index the Twitter 'firehose' with Solr
> > Cloud - roughly 500-600 million docs per day indexing each of the fields
> > (about 180)?
>
> Possible, yes. Reasonable? It is not going to be cheap.
>
> Twitter index the tweets themselves and have been quite open about how
> they do it. I would suggest looking for their presentations; slides or
> recordings. They have presented at Berlin Buzzwords and Lucene/Solr
> Revolution and probably elsewhere too. The gist is that they have done a
> lot of work and custom coding to handle it.
>
> > If I were to guess at a sharded setup to handle such data, and keep 2
> years
> > worth, I would guess about 2500 shards.  Is that reasonable?
>
> I think you need to think well beyond standard SolrCloud setups. Even if
> you manage to get 2500 shards running, you will want to do a lot of
> tweaking on the way to issue queries so that each request does not require
> all 2500 shards to be searched. Prioritizing newer material and only query
> the older shards if there is not enough resent results is an example.
>
> I highly doubt that a single SolrCloud is the best answer here. Maybe one
> cloud for each month and a lot of external logic?
>
> - Toke Eskildsen
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message