Guidelines for Deploying Cloudera Search
CDH initially deploys Solr with a JVM size of 1 GB. In the context of Search, 1 GB is a small value. Starting with this small value simplifies JVM deployment, but the value is insufficient for most actual use cases. Some of the factors to consider when determining an acceptable value for production usage are:
- The more searchable material you have, the more memory you need. All things being equal, 10 TB of searchable data requires more memory than 1 TB of searchable data.
- What is indexed within the searchable material. Indexing all fields in a collection of logs, emails, or Wikipedia entries requires more memory than indexing only the Date Created field.
- What level of performance is required. If the system must be stable and respond quickly, more memory may help. If slow responses are acceptable, you may be able to use less memory.
The only way to ensure an appropriate amount of memory is to consider your requirements and experiment in your environment. In general:
- 4 GB may be acceptable for smaller loads or for evaluation.
- 12 GB is sufficient for some production environments.
- 48 GB is sufficient for most situations.
This topic guides you towards solutions, rather than providing a list of absolute requirements. Having a sample application to benchmark different use cases and data types and sizes can help you identify the most important performance factors.
To determine how best to deploy search in your environment, it is important to define use cases. The same Solr index can have very different hardware requirements depending upon the queries that are performed. The most common variation in hardware requirement is memory. For example, the memory requirements for faceting vary depending on the number of unique terms in the faceted field. Suppose you want to use faceting on a field that has 10 unique values. In such a case, only ten logical containers are required for counting. As a result, no matter how many documents are in the index, memory overhead is almost non-existent in this example. Conversely the same index could have unique timestamps for every entry and you want to facet on that field with a : -type query. In such a case, each index would require its own logical container. Under such an organization, if you had many documents, 500 million, for example, then faceting across 10 such fields would increase the RAM requirements significantly.
For this reason, use cases and some characterizations of the data is required before you can estimate hardware requirements. The important parameters to consider are:
- Number of documents. For Cloudera Search, sharding is almost always required.
- Approximate word count for each potential field.
- What information is stored in the Solr index and what is only for searching. Information stored in the index is returned with the search results.
- Foreign language support.
- How many different languages appear in your data?
- What percentage of documents are in each language?
- Is language-specific searching to be supported? This determines whether accent folding and storing the text in a single field is sufficient.
- What language families are going to be searched? For example, you could combine all Western European languages into a single field, but it is not practical to combine English and Chinese into a single field. Even within more similar sets of languages using a single field for different languages may be problematic. For instance, sometimes accents alter the meaning of a word, and in such a case, accent folding loses important distinctions.
- Faceting requirements
- Be wary of faceting on fields that have many unique terms. For example, faceting on timestamps or free-text fields typically has a high cost. Faceting on a field with more than 10,000 unique values is typically not useful. Ensure that any such faceting requirement is necessary.
- Types of facets. You can facet on queries as well as field values. Faceting on queries is often useful for dates. For example, “in the last day” or “in the last week” can be valuable. Using Solr Date Math to facet on a bare “NOW” is almost always inefficient. Facet-by-query is not memory-intensive since the number of logical containers is limited by the number of queries specified, no matter how many unique values are in the underlying field. This can enable faceting on things such as dates or times, while avoiding the problem described for faceting on fields with unique terms.
- Sorting requirements
- Sorting requires one integer for each document (maxDoc) and takes up significant memory. Additionally, sorting on strings requires storing each unique string value.
- Will there be an “advanced” search capability? If
so, how will that be implemented? You must make significant design decisions based on user
- Can users be expected to learn about the system? “Advanced” screens may intimidate e-commerce users, but such screens may be most effective if users can be expected to learn them.
- How long will your users wait for results? Data mining results in longer user wait times. Of course, you do not want users to wait any longer than necessary, but there are other design decisions related to reasonable response times.
- How many simultaneous users must your system accommodate?
- Update requirements. An update in Solr refers
both to adding new documents and changing existing documents.
- Loading new documents.
- Bulk. Are there cases where the index has to be rebuilt from scratch or will there only be an initial load?
- Incremental. At what rate will new documents enter the system?
- Updating documents. Can you characterize the expected number of modifications to existing documents?
- How much latency is acceptable between when a document is added to Solr and when it is available in Search results?
- Loading new documents.
- Security requirements. Solr has no built-in
security options, though Cloudera Search supports authorization using Sentry and authentication using Kerberos. In Solr,
document-level security is usually best accomplished by indexing some kind of
authorization tokens along with the document. The number of authorization tokens applied
to a document is largely irrelevant; thousands are reasonable though such large numbers
can be difficult to administer. The number of authorization tokens associated with a
particular user should be much smaller. 100 is often a good upper limit. This is because
security at this level is usually enforced by appending an “fq” clause to the query and
putting thousands of tokens in an “fq” clause is expensive.
- There is a post filter, also know as a no-cache filter, that can help with access schemes that can not use the first option. These are not cached and are applied only after all the less-expensive filters are applied.
- If grouping, faceting is not required to accurately reflect true document counts, so some shortcuts can be taken. For example, ACL filtering is notoriously expensive in some systems, sometimes requiring database access. If completely accurate faceting is required, you must completely process the list to reflect accurate facets.
- Required query rate, usually measured in
- At a minimum, deploy machines with sufficient hardware resources to provide a reasonable response rate for a single user. It is possible create queries that place demands on the system such that even a few users do not receive acceptable performance. In such a case, re-sharding is necessary.
- If QPS is only somewhat slower than required and you do not want to reshard, you can get some performance improvement by adding replicas to each shard.
- As your the number of shards in your deployment increases, so, too, does the likelihood that one of the shards will be unusually slow. In such a case, the general QPS rate falls, though very slowly. This typically occurs as the number of shards reaches the scale of hundreds.