Introduction to Apache Spark

Pratik
2 min readMar 31, 2021

Apache Spark is a unified engine designed for large-scale distributed data processing on premises in data centers or in the cloud.

Spark provides in-memory storage for intermediate computations, making it much faster than Hadoop MapReduce. Spark incorporated libraries with composable APIs for machine learning(MLib), SQL for interactive queries(Spar SQL), stream processing (Structured Streaming) for interacting with real-data, and graph processing(Graph X).

Spark’s design philosophy centers around four key characteristics:

  • Speed
  • Ease of use
  • Modularity
  • Extensibility

Speed:

Spark’s internal implementation benefits immensely from the hardware industry’s recent huge strides in improving the price and performance of hardware.

Second, Spark builds its query computations as a directed acyclic graph(DAG); its DAG scheduler and query optimizer construct and efficient computational graph that can usually be decomposed into tasks that are executed in parallel across workers on the cluster.

And third, its physical execution engine, Tungsten, uses whole-stage code generation to generate compact code for execution.

Ease of Use:

Spark achieves simplicity by providing a fundamental abstraction of a simple logical data structure called a Resilient Distributed Dataset(RDD) upon which all other higher-level structured data abstractions, such as DataFrames and Datasets, are constructed. By Providing a set of transformations and actions as operations, Spark offers a simple programming model that you can use to build big data applications in familiar language

Modularity:

You can write a single Spark application that can do it all — no need for distinct engines for disparate workloads, no need to learn separate APIs. With Spark you get a unified processing engine for your workloads.

Extensibility:

Spark focuses on its fast, parallel computation engine rather than on storage. Unlike Apache Hadoop, which included both storage and compute, Spark decouples the two storage and compute. That means you can use spark to read data stored in myriad sources and process it all in memory.

--

--