SYS-CON MEDIA Authors: Liz McMillan, Elizabeth White, Maria C. Horton, Andy Thurai, Zakia Bouachraoui

Blog Feed Post

Top 15 Solr vs. Elasticsearch Differences

Solr vs. Elasticsearch. Elasticsearch vs. Solr.  Which one is better? How are they different? Which one should you use?

Before we start, check out two useful Cheat Sheets to guide you through both Solr and Elasticsearch and help boost your productivity and save time when you’re working with any of these two open-source search engines.

These two are the leading, competing open-source search engines known to anyone who has ever looked into (open-source) search.  They are both built around the core underlying search library – Lucene – but they are different.  Like everything, each of them has its set of strengths and weaknesses and each may be a better or worse fit depending on your needs and expectations. In the past, we’ve covered Solr and Elasticsearch differences in Solr Elasticsearch Comparison and in various conference talks, such as Side by Side with Elasticsearch and Solr: Performance and Scalability given at Berlin Buzzwords. Both Solr and Elasticsearch are evolving rapidly so, without further ado, here is up to date information about their top differences.

Feature

Solr/SolrCloud

Elasticsearch

Community & Developers Apache Software Foundation and community support Single commercial entity and its employees
Node Discovery Apache Zookeeper, mature and battle tested in a large number of projects Zen, built into Elasticsearch itself, requires dedicated master nodes to be split brain proof
Shard Placement Static in nature, requires manual work to migrate shards Dynamic, shards can be moved on demand depending on the cluster state
Caches Global, invalidated with each segment change Per segment, better for dynamically changing data
Analytics Engine Facets and powerful streaming aggregations Sophisticated and highly flexible aggregations
Optimized Query Execution Currently none Faster range queries depending on the context
Search Speed Best for static data, because of caches and uninverted reader Very good for rapidly changing data, because of per segment caches
Analysis Engine Performance Great for static data with exact calculations Exactness of the results depends on data placement
Full Text Search Features Language analysis based on Lucene, multiple suggesters, spell checkers, rich highlighting support Language analysis based on Lucene, single suggest API implementation, highlighting rescoring
DevOps Friendliness Not fully there yet, but coming Very good APIs
Non-flat Data Handling Nested documents and parent child support Natural support with nested and object types allowing for virtually endless nesting and parent – child support
Query DSL JSON (limited), XML (limited) or URL parameters JSON
Index/Collection Leader Control Leader placement control and leader rebalancing possibility to even the load on the nodes Not possible
Machine Learning Built-in – on top of streaming aggregations focused on logistic regression and learning to rank contrib module Commercial feature, focused on anomalies and outliers and time-series data
Ecosystem Modest – Banana, Zeppelin with community support Rich – Kibana, Grafana, with large entities support and big user base

Now that we know what are the top 15 differences let’s discuss each of the mentioned differences in greater details.

Community and Developers

The first major difference between Solr and Elasticsearch is how they are developed, maintained and supported. Solr, being a project of Apache Software Foundation is developed with ASF philosophy in mind – Community over code. Solr code is not always beautiful, but once the feature is there it usually stays there and is not removed from the code base. Also, the committers come from different companies and there is no single company controlling the code base. You can become a committer if you show you interest and continued support for the project. On the other hand we have Elasticsearch backed by a single entity – the Elastic company. The code is available under the Apache 2.0 software license and the code is open and available on Github, so you can take part in the development by submitting pull requests, but the community is not the one that decides what will get into the code base and what will not.  Also, to become a committer, you will have to be a part of the Elastic company itself.

Node Discovery

Another major difference between those two great products is the node discovery. When the cluster is initially formed, when a new node joins or when something bad happens to a node something in the cluster, based on the given criteria, has to decide what should be done. This is one of the responsibilities of so called node discovery. Elasticsearch uses its own discovery implementation called Zen that, for full fault tolerance (i.e. not being affected by network splits), requires three dedicated master nodes. Solr uses Apache Zookeeper for discovery and leader election. This requires an external Zookeeper ensemble, which for fault tolerant and fully available SolrCloud cluster requires at least three Zookeeper instances.

Shard Placement

Generally speaking Elasticsearch is very dynamic as far as placement of  indices and shards they are built of is concerned. It can move shards around the cluster when a certain action happens – for example, when a new node joins or a node is removed from the cluster. We can control where the shard should and shouldn’t be placed by using awareness tags and we can tell Elasticsearch to move move shards around on demand using an API call. Solr, on the other hand, is a bit more static. When a Solr node joins or leaves the cluster Solr doesn’t do anything on its own, it is up to us to rebalance the data. Of course, we can move shards, but it involves several steps – we need to create a replica, wait for it to synchronize the data and then remove the one that we no longer need. There is one thing that allows us to automate things a bit – removing or replacing a node in SolrCloud using Collection API, which is a quick way of removing all shards or quickly replicate them to another node. Though this still requires manual API call, not something that is done automatically.

Caches

Yet another big difference is the architecture of the two discussed search engines.  Not getting deep into how the caches work in both products we will point out just the major difference between them. Let’s start with what a segment is. A segment is a piece of Lucene index that is built of various files, is mostly immutable, and contains data. When you index data Lucene produces segments and can also merge multiple smaller, already existing ones into larger ones during process called segment merging. The caches in Solr are global, a single cache instance of a given type for a shard, for all its segments. When a single segment changes the whole cache needs to be invalidated and refreshed. That takes time and consumes hardware resources. In Elasticsearch caches are per segment, which means that if only a single segment changed then only a small portion of the cached data needs to be invalidated and refreshed. We will get to the pros and cons of such approach soon.

Analytics Engine

Solr is large and has a lot of data analysis capabilities. We can start with good, old facets – the first implementation that allowed to slice and dice through the data to understand it and get to know it. Then came the JSON facets with similar features, but faster and less memory demanding, and finally the stream based expressions called streaming expressions which can combine data from multiple sources (like SQL, Solr, facets) and decorate them using various expressions (sort, extract, count significant terms, etc). Elasticsearch provides a powerful aggregations engine that not only can do one level data analysis like most of the Solr legacy facets, but can also nest data analysis (e.g., calculate average price for each product category in each shop division), but supports for analysis on top of aggregation results, which leads to functionality like moving averages calculation. Finally, though marked as experimental, Elasticsearch provides support for matrix aggregation, which can compute statistics over a set of fields.

Optimized Query Execution

When dealing with time based data, range queries are very common and can become a bottleneck, because of the amount of data they need to process to match the given search results. With the recent releases of Elasticsearch, for the fields that have doc values enabled (like numeric fields), Elasticsearch is able to choose whether to iterate over all the documents or only match a particular set of documents. With the logic inside the search engine, Elasticsearch can provide a very efficient range queries without any data modifications. Hopefully we will see a similar functionality in Solr as well.

Search Speed

Some time ago we did a few comparisons of Solr and Elasticsearch and the results were pretty clear. Solr is awesome when it comes to the static data, because of its caches and the ability to use uninverted reader for faceting and sorting – for example e-commerce. Elasticsearch is great in rapidly changing environments, like log analysis use cases. If you want to learn more, check out the video from two of our engineers – Radu and Rafał giving Side by Side with Elasticsearch & Solr Part 2 – Performance and Scalability talk at Berlin Buzzwords 2015.

Analysis Engine Performance & Precision

When having a mostly static data and needing full precision for data analysis and blazingly fast performance you should look at Solr. With the tests we did for some conference talks (like the mentioned Side by Side with Elasticsearch & Solr Part 2 – Performance and Scalability talk at Berlin Buzzwords 2015) we saw that on static data Solr was awesome. What’s more, compared to Elasticsearch facets in Solr are precise and do not lose precision, which is not always true with Elasticsearch. In certain edge cases, you may find results in Elasticsearch aggregations not to be precise, because of how data in the shards is placed.

Full Text Search Features

The richness of full text search related features and the ones that are close to full text searching is enormous when looking into Solr code base. Our Solr training classes are chalk-full of this stuff!  Starting from a wide selection of request parsers, through various suggester implementations, to ability to correct user spelling mistakes using spell checkers and extensive highlighting support which is highly configurable. In Elasticsearch we have a dedicated suggesters API which hides the implementation details from the user giving us an easier way of implementing suggestions at the cost of reduced flexibility and of course highlighting which is less configurable than highlighting in Solr (though both are based on Lucene highlighting functionality).

DevOps Friendliness

If you were to ask a devops person what (s)he loves about Elasticsearch the answer would be the API, manageability and ease of installation. When it comes to troubleshooting Elasticsearch is just easy to get information about its state – from disk usage information, through memory and garbage collection work statistics to the internal of Elasticsearch like caches, buffers and thread pools utilization. Solr is not there yet – you can get some amount of information from it via  JMX MBean and from the new Solr Metrics API, but this means there are a few places one must look a and not everything is in there, though it’s getting there

Non-flat Data Handling

You have non-flat data, with lots of nested objects inside nested object and inside another nested object and you don’t want to flatten down the data, but just index your beautiful MongoDB JSON objects and have it ready for full text searching? Elasticsearch will be a perfect tool for that with its support for objects, nested documents and parent – child relationship. Solr may not be the best fit here, but remember that it also supports parent – child and nested documents both when indexing XML documents as well as JSON. Also there is one more very important thing – Solr supports query time joins inside and across different collections, so you are not limited to index time parent – child handling.

Query DSL

Let’s say it out loud – the query language of Elasticsearch is really great. If you love JSON. It lets you structure the query using JSON, so it will be well structured and give you the control over the whole logic. You can mix different kinds of queries to write very sophisticated matching logic. Of course, full text search is not everything and you can include aggregations, results collapsing and so on – basically everything that you need from your data can be expressed in the query language. Solr on the other hand is still using the URI search, at least in its most widely used API (there is also limited JSON API and XML query parser available). All the parameters go into the URI, which can lead to long and complicated queries. Both approaches have their pros and cons and novice users tend to need help with queries using both search engines.

Index/Collection Leader Control

While Elasticsearch is dynamic in its nature when it comes to shard placement around the cluster it doesn’t give us much control over which shards will take the role of the primaries and which ones will be the replicas. It is beyond our control. In Solr you have that control, which is a very good thing when you consider that during indexing the leaders are the ones that do more work, because of forwarding the data to all their replicas. With the ability to rebalance the leaders or explicitly say where they should be put we have the perfect ability to balance the load across the cluster, by providing exact information about where the leader shards should be.

Machine Learning

Trending topic  about which you will hear even more in the coming months and years – machine learning. In Solr it comes for free in a form of a contrib module and on top of streaming aggregations framework. With the use of the additional libraries in the contrib module you can use the machine learned ranking models and feature extraction on top of Solr, while the streaming aggregations based machine learning is focused on text classification using logistic regression. On the other hand we have Elasticsearch and its X-Pack commercial plugin which comes with a plugin for Kibana that supports machine learning algorithms focused on anomaly detection and outlier detection in time series data. It’s a nice package of tools bundled with professional services, but quite pricey. Thus, we unpacked the X-Pack and listed available X-Pack alternatives: from open source tools, commercial alternatives, or cloud services.

Ecosystem

When it comes to the Ecosystem, the tools that come with Solr are nice, but they feel modest. We have Kibana port called Banana which went its own way and tools like Apache Zeppelin integration that allows to run SQL on top of Apache Solr. Of course there are other tools, which can either read data from Solr, send data to Solr or use Solr as the data source – like Flume for example. Most of the tools are developed and supported by a wide variety of enthusiasts. If you look at the ecosystem around Elasticsearch it is very modern and sorted. You have new version of Kibana with new features popping up every month. If you don’t like Kibana, you have Grafana which is now a product on its own providing a wide variety of features, you have long list of data data shippers and tools that can use Elasticsearch as a data source. Finally, those products are not only backed up by enthusiasts but also by large, commercial entities. This is obviously not an exhaustive list of Solr and Elasticsearch differences. We could go on for several blog posts and make a book out of it, but hopefully the above list gave you the idea on what to expect from one and the other.

Want to learn more about Solr or Elasticsearch?

Don’t forget to download the Cheat Sheet you need. Here they are:

Then, subscribe to our blog or follow @sematext. If you need any help with Solr / SolrCloud or Elasticsearch – don’t forget that we provide Solr & Elasticsearch Consulting, Solr & Elasticsearch Production Support, and offer both Solr & Elasticsearch Training.

Read the original blog entry...

More Stories By Sematext Blog

Sematext is a globally distributed organization that builds innovative Cloud and On Premises solutions for performance monitoring, alerting and anomaly detection (SPM), log management and analytics (Logsene), and search analytics (SSA). We also provide Search and Big Data consulting services and offer 24/7 production support for Solr and Elasticsearch.

Latest Stories
The Software Defined Data Center (SDDC), which enables organizations to seamlessly run in a hybrid cloud model (public + private cloud), is here to stay. IDC estimates that the software-defined networking market will be valued at $3.7 billion by 2016. Security is a key component and benefit of the SDDC, and offers an opportunity to build security 'from the ground up' and weave it into the environment from day one. In his session at 16th Cloud Expo, Reuven Harrison, CTO and Co-Founder of Tufin, ...
Historically, some banking activities such as trading have been relying heavily on analytics and cutting edge algorithmic tools. The coming of age of powerful data analytics solutions combined with the development of intelligent algorithms have created new opportunities for financial institutions. In his session at 20th Cloud Expo, Sebastien Meunier, Head of Digital for North America at Chappuis Halder & Co., discussed how these tools can be leveraged to develop a lasting competitive advantage ...
While the focus and objectives of IoT initiatives are many and diverse, they all share a few common attributes, and one of those is the network. Commonly, that network includes the Internet, over which there isn't any real control for performance and availability. Or is there? The current state of the art for Big Data analytics, as applied to network telemetry, offers new opportunities for improving and assuring operational integrity. In his session at @ThingsExpo, Jim Frey, Vice President of S...
"We were founded in 2003 and the way we were founded was about good backup and good disaster recovery for our clients, and for the last 20 years we've been pretty consistent with that," noted Marc Malafronte, Territory Manager at StorageCraft, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm. In their Day 3 Keynote at 20th Cloud Expo, Chris Brown, a Solutions Marketing Manager at Nutanix, and Mark Lav...
In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, provided an overview of the evolution of the Internet and the Database and the future of their combination – the Blockchain. Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life settl...
"At the keynote this morning we spoke about the value proposition of Nutanix, of having a DevOps culture and a mindset, and the business outcomes of achieving agility and scale, which everybody here is trying to accomplish," noted Mark Lavi, DevOps Solution Architect at Nutanix, in this SYS-CON.tv interview at @DevOpsSummit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
According to the IDC InfoBrief, Sponsored by Nutanix, “Surviving and Thriving in a Multi-cloud World,” multicloud deployments are now the norm for enterprise organizations – less than 30% of customers report using single cloud environments. Most customers leverage different cloud platforms across multiple service providers. The interoperability of data and applications between these varied cloud environments is growing in importance and yet access to hybrid cloud capabilities where a single appl...
@CloudEXPO and @ExpoDX, two of the most influential technology events in the world, have hosted hundreds of sponsors and exhibitors since our launch 10 years ago. @CloudEXPO and @ExpoDX New York and Silicon Valley provide a full year of face-to-face marketing opportunities for your company. Each sponsorship and exhibit package comes with pre and post-show marketing programs. By sponsoring and exhibiting in New York and Silicon Valley, you reach a full complement of decision makers and buyers in ...
In today's always-on world, customer expectations have changed. Competitive differentiation is delivered through rapid software innovations, the ability to respond to issues quickly and by releasing high-quality code with minimal interruptions. DevOps isn't some far off goal; it's methodologies and practices are a response to this demand. The demand to go faster. The demand for more uptime. The demand to innovate. In this keynote, we will cover the Nutanix Developer Stack. Built from the foundat...
"Cloud computing is certainly changing how people consume storage, how they use it, and what they use it for. It's also making people rethink how they architect their environment," stated Brad Winett, Senior Technologist for DDN Storage, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
Sold by Nutanix, Nutanix Mine with Veeam can be deployed in minutes and simplifies the full lifecycle of data backup operations, including on-going management, scaling and troubleshooting. The offering combines highly-efficient storage working in concert with Veeam Backup and Replication, helping customers achieve comprehensive data protection for all their workloads — virtual, physical and private cloud —to meet increasing business demands for uptime and productivity.
Two weeks ago (November 3-5), I attended the Cloud Expo Silicon Valley as a speaker, where I presented on the security and privacy due diligence requirements for cloud solutions. Cloud security is a topical issue for every CIO, CISO, and technology buyer. Decision-makers are always looking for insights on how to mitigate the security risks of implementing and using cloud solutions. Based on the presentation topics covered at the conference, as well as the general discussions heard between sessio...
"NetApp's vision is how we help organizations manage data - delivering the right data in the right place, in the right time, to the people who need it, and doing it agnostic to what the platform is," explained Josh Atwell, Developer Advocate for NetApp, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
A look across the tech landscape at the disruptive technologies that are increasing in prominence and speculate as to which will be most impactful for communications – namely, AI and Cloud Computing. In his session at 20th Cloud Expo, Curtis Peterson, VP of Operations at RingCentral, highlighted the current challenges of these transformative technologies and shared strategies for preparing your organization for these changes. This “view from the top” outlined the latest trends and developments i...