Leveraging the power of SolrCloud and Spark with OpenShift

One of the most commonly used big data processing frameworks is Apache Spark. Spark manages to process large datasets with parallelization. Solr is a search platform based on Lucene. Solr can be distributed across a cluster using ZooKeeper for configuration management. Both applications can be combined to create performant Big Data applications.

But what if you want to scale up horizonally and add a node? In a manual setup, you’d have to install the new node manually. Cluster orchestrators like OpenShift claim to solve this problem. This talk shows how to put Spark, Solr and ZooKeeper into containers, which can then be scaled individually inside a cluster using OpenShift. We will cover OpenShift details like DeploymentConfigs, StatefulSets, Services, Routes and Persistent Volumes and install a complete, failsafe and horizontally scaleable SolrCloud / Spark / Zookeeper cluster in seconds. You will also learn about the drawbacks and pitfalls of running Big Data applications inside an OpenShift cluster.