This post elaborates on Apache Spark transformation and action operations by providing a step by step walk through of Spark examples in Scala.
Before you dive into these examples, make sure you know some of the basic Apache Spark Concepts. These examples are in no particular sequence and is the first part of our five part Spark Scala examples post. They assume you have an Apache Hadoop ecosystem setup, and have some sample data files created. If you do not have Apache Hadoop installed, consider downloading and installing the QuickStart VM by Cloudera. It is the simplest and quickest way to get up and running with minimal headaches.
Let us start by looking at 4 Spark examples on using
If you are struggling to figure out how to run a Spark Scala program, this section gets straight to the point.
The first step to writing an Apache Spark application (program) is to invoke the program, which includes initializing the configuration variables and accessing the cluster. SparkContext is the gateway to accessing Spark functionality.
For beginners, the best and simplest option is to use the Scala shell, which auto creates a SparkContext. Below are 4 Spark examples on how to connect and run Spark.
To login to Scala shell, at the command line interface, type “/bin/spark-shell “.
To login and run Spark locally without parallelism: “/bin/spark-shell –master local “.
To login and run Spark locally in parallel mode, setting the parallelism level to the number of cores on your machine: “/bing/spark-shell –master local[*] “.
To login and connect to Yarn in client mode: “/bin/spark-shell –master yarn-client “.
Before we look at a couple of examples, it is important to understand what parallelize spark means.
Parallelize is a method to partition an RDD to speed up processing. The syntax for parallelizing an RDD is sc.parallelize(data, p), where ‘p’ represents the number of partitions based on the cluster.
For the most part the number of partitions are automatically set, so you do not need to worry about it.
In the below Spark Scala examples, we look at parallelizeing a sample set of numbers, a List and an Array.
Creating an RDD in Apache Spark requires data. In Spark, there are two ways to aquire this data: parallelized collections and external datasets. Data not in an RDD is classified as an external dataset and includes flat files, binary files,sequence files, hdfs file format, HBase, Cassandra or in any random format.
The 3 Spark examples listed below shows you the most common ways to read data from hdfs.
Note: The RDD returned is in a key,value pair format where the key represents the path of each file, and the value represents the entire contents of the file.
A Spark filter is a transformation operation which takes an existing dataset, applies a reducing function and returns data for which the reducing function returns a true Boolean.
Conceptually, this is similar to applying a column filter in an excel spreadsheet, or a “where” clause in a sql statement.
Listed below are a few Spark Scala examples on using a filter operation.
Note1: To perform a count() action on the filter output and validate, type the below at the command line: