lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: Solr limitations
Date Mon, 08 Jul 2013 14:28:55 GMT
Other that the per-node/per-collection limit of 2 billion documents per 
Lucene index, most of the limits of Solr are performance-based limits - Solr 
can handle it, but the performance may not be acceptable. Dynamic fields are 
a great example. Nothing prevents you from creating a document with, say, 
50,000 dynamic fields, but you are likely to find the performance less than 
acceptable. Or facets. Sure, Solr will let you have 5,000 faceted fields, 
but the performance is likely to be... you get the picture.

What is acceptable performance? That's for you to decide.

What will the performance of 5,000 dynamic fields or 500 faceted fields or 
500 million documents on a node be? It all depends on your data, especially 
the cardinality (unique values) of each individual field.

How can you determine the performance? Only one way: Proof of concept. You 
need to do your own proof of concept implementation, with your own 
representative data, with your own representative data model, with your own 
representative hardware, with your own representative client software, with 
your own representative user query load. That testing will give you all the 
answers you need.

There are are no magic answers. Don't believe any magic spreadsheet or magic 
wizard. Flip a coin whether they will work for your situation.

Some simple, common sense limits:

1. No more than 50 to 100 million documents per node.
2. No more than 250 fields per document.
3. No more than 250K characters per document.
4. No more than 25 faceted fields.
5. No more than 32 nodes in your SolrCloud cluster.
6. Don't return more than 250 results on a query.

None of those is a hard limit, but don't go beyond them unless your Proof of 
Concept testing proves that performance is acceptable for your situation.

Start with a simple 4-node, 2-shard, 2-replica cluster for preliminary tests 
and then scale as needed.

Dynamic and multivalued fields? Try to stay away from them - excepts for the 
simplest cases, they are usually an indicator of a weak data model. Sure, 
it's fine to store a relatively small number of values in a multivalued 
field (say, dozens of values), but be aware that you can't directly access 
individual values, you can't tell which was matched on a query, and you 
can't coordinate values between multiple multivalued fields. Except for very 
simple cases, multivalued fields should be flattened into multiple documents 
with a parent ID.

Since you brought up the topic of dynamic fields, I am curious how you got 
the impression that they were a good technique to use as a starting point. 
They're fine for prototyping and hacking, and fine when used in moderation, 
but not when used to excess. The whole point of Solr is searching and 
searching is optimized within fields, not across fields, so having lots of 
dynamic fields is counter to the primary strengths of Lucene and Solr. 
And... schemas with lots  of dynamic fields tend to be difficult to 
maintain. For example, if you wanted to ask a support question here, one of 
the first things we want to know is what your schema looks like, but with 
lots of dynamic fields it is not possible to have a simple discussion of 
what your schema looks like.

Sure, there is something called "schemaless design" (and Solr supports that 
in 4.4), but that's very different from heavy reliance on dynamic fields in 
the traditional sense. Schemaless design is A-OK, but using dynamic fields 
for "arrays" of data in a single document is a poor match for the search 
features of Solr (e.g., Edismax searching across multiple fields.)

One other tidbit: Although Solr does not enforce naming conventions for 
field names, and you can put special characters in them, there are plenty of 
features in Solr, such as the common "fl" parameter, where field names are 
expected to adhere to Java naming rules. When people start "going wild" with 
dynamic fields, it is common that they start "going wild" with their names 
as well, using spaces, colons, slashes, etc. that cannot be parsed in the 
"fl" and "qf" parameters, for example. Please don't go there!

In short, put up a small cluster and start doing a Proof of Concept cluster. 
Stay within my suggested guidelines and you should do okay.

-- Jack Krupansky

-----Original Message----- 
From: Marcelo Elias Del Valle
Sent: Monday, July 08, 2013 9:46 AM
To: solr-user@lucene.apache.org
Subject: Solr limitations

Hello everyone,

    I am trying to search information about possible solr limitations I
should consider in my architecture. Things like max number of dynamic
fields, max number o documents in SolrCloud, etc.
    Does anyone know where I can find this info?

Best regards,
-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr 


Mime
View raw message