Apache Spark Concepts – Everything you need to know

If your are a beginner, understanding the basics of Apache Spark will help you build a strong foundation before you get to the more complex concepts. Often times these concepts are intertwined with new terminology. Associating the terminology with each other in a hierarchical manner to build relationships has been determined to help assimilate information effectively, and in a shorter time frame.

 The diagram below is a mind map of the key Apache Spark concepts. A  mind map is a technique use to organize information visually, increasing the brains ability to retain information dramatically. For more information on mind mapping, read this

The central idea in the mind map below is Apache Spark, with Resilient Distributed Datasets (RDD) and Spark SQL as the main branches. Each one of those main branches have secondary and tertiary branches radiating from it, denoting the relationships between the Apache Spark concepts. Print a copy of the below diagram and refer to it until these sparks concepts become second nature.

apache spark concepts image

Resilient Distributed Dataset (RDD)

A collection of elements partitioned and distributed across multiple nodes in a cluster making it resilient in nature. 

1. External Datasets

In Apache Spark, an RDD can be created in one of two ways. External Dataset is one such method in which a file external to Apache Spark: local file system, hdfs, NoSQL databases, can be used to create an RDD. 

2. Parallelized Collections

This is an alternate method to create an RDD instead of using the external datasets method. In Parallelized Collections, same data can be type in or copy pasted to create a sample RDD on the fly.

3. Actions

An action is an operation performed on an RDD which returns a value. In the case of an action such as count(), the result of the action operation results in a value. For a list of Apache Spark actions, click here.

4. External Datasets

A transformation is an operation performed on an RDD which results in the creation of a new dataset. In the case of a filter tranformation, it takes an existing dataset and reduces it based on the filter criteria. The result is a new dataset. 

map, filter, distinct, groupByKey, reducbyKey, join are some most commonly used transformations on an RDD dataset in Apache Spark.

For a detailed list of Apache Spark transformations, click here.

5. Key-Value Pairs

A Key-Value pair is a representation of a data value and its attribute as a set. The data attribute often uniquely identifies the value, hence the term Key. 

In Apache Spark, there are more than a handful of operations that work on Key-Value pairs; aggregateByKey, combineByKey, lookup are some examples. 

For a full list of Apache Spark Key-Value pair operations, click here.

Apache Spark SQL

Spark SQL is a module in Apache Spark used for processing structured data. It provides the capability to interact with data using Structured Query Language (SQL) or the Dataset application programming interface.

The main benefit of the Spark SQL module is that it brings the familiarity of SQL for interacting with data. For a comprehensive list of  Spark SQL functions, click here. 

1. Datasets

A Dataset in the context of Apache Spark SQL is a collection of data distributed over one or more partitions. 

2. DataFrames

A Spark SQL DataFrame is a Dataset of rows distributed over one or more partitions, arranged into named columns. It is analogous to a table in a relational database.

Data Sources

Spark SQL has the functionality to operate on data in a number of different formats. Parquet, JSON, Hive and ORC are some of these formats. Spark SQL loads  data in these formats into a DataFrame, which can then be queried using SQL or transformations.

1. Parquet files

Apache Parquet is a columnar storage format by the Apache Software Foundation and is heavily used in the Hadoop Ecosystem. The advantage of Parquet lies in its nested data structures and efficient compression. For documentation and examples on parquet usage, click here

2. JSON Datasets

JSON, short of JavaScript Object Notation is a format recognized for its ease of readability by humans and machines. The core abstract of JSON is its name/value pair collection and ordered list of values.

Spark SQL has the built in functionality to recognize a JSON schema and load the data into a Spark SQL Dataset.

For documentation on JSON structure, click here

3. Hive Tables

Hive is a Data Warehouse application by Apache Software Foundation. The tables in Hive can be read by Apache Spark SQL for analysis, summary and other related processing.

For documentation on Apache Hive tables, and DDL examples, click here

4. ORC Files

ORC short for Optimized RC file format is another columnar data format supported by Apache Spark. Some of ORC’s features include built in indexes, complex type and ACID support. 

For more details on ORC , click here.

Spark SQL Related Resources

Programming Guide

The official Apache Spark v2.3.2 Spark SQL Programming Guide.

Email us at : info@obstkel.com

Copyright 2020 © OBSTKEL LLC. All rights Reserved.
Scroll to Top