Spark Scala examples: A step-by-step guide – Part I

This post elaborates on Apache Spark transformation and action operations by providing a step by step walk through spark scala examples. Before you dive into these examples, make sure you know some of the basic Apache Spark Concepts. These examples are in no particular sequence and is the first part of our five part spark scala examples post. These examples assume that you have an Apache Hadoop ecosystem setup, and have some sample data files created. If you do not have Apache Hadoop installed, consider downloading and installing the QuickStart VM by Cloudera. It is the simplest and quickest way to get up and running with minimal headaches. 

If you are interested in getting familiar with Spark SQL functions, click here.

Spark Context example - *How to run Spark*

If you are struggling to figure out how to run a Spark Scala program, this section gets straight to the point.

The first step to writing an Apache Spark application (program) is to invoke the program, which includes initializing the configuration variables and accessing the cluster. SparkContext is the gateway to accessing Spark functionality.

For beginners, the best and simplest option is to use the Scala shell, which auto creates a SparkContext. Below are 4 Spark Context examples on how to connect and run Spark.

Example 1:

To login to Scala shell, at the command line interface, type “/bin/spark-shell “. 

Example 2: 

To login and run Spark locally without parallelism: “/bin/spark-shell –master local “.

Example 3:

To login and run Spark locally in parallel mode, setting the parallelism level to the number of cores on your machine: “/bing/spark-shell –master local[*] “. 

Example 4:

To login and connect to Yarn in client mode: “/bin/spark-shell –master yarn-client “.

Spark Scala Examples on Parallelize

Parallelize is a method used to create an RDD in Apache Spark based on data that’s already being processed in memory. Its also a great method to generate an RDD on the fly for testing and debugging code.

Examples listed below are some of the most frequent ways for Spark to read from hdfs.

Example 1:

    • To create an RDD using Apache Spark Parallelize method on a sample set of numbers, say 1 thru 100.
      • scala > val parSeqRDD = sc.parallelize(1 to 100)

Example 2:

    • To create an RDD from a Scala List using the Parallelize method.
      • scala > val parNumArrayRDD = sc.parallelize(List(“pen”,”laptop”,”pencil”,”mouse”))
Note: To view a sample set of data loaded in the RDD, type this at the command line: parNumArrayRDD.take(3).foreach(println)

Example 3:

    • To create an RDD from an Array using the Parallelize method. scala > **val parNumArrayRDD = sc.parallelize(Array(1,2,3,4,5))*
Note: To count the number of elements created in the RDD, type this at the comand line: parNumArrayRDD.count()

Spark read from hdfs example

Creating an RDD in Apache Spark requires data. In Spark, there are two ways to aquire this data: parallelized collections and external datasets. Data not in an RDD is classified as an external dataset and includes flat files, binary files,sequence files, hdfs file format, HBase, Cassandra or in any random format.

The Spark Scala Examples listed below are some additional ways for Spark to read from hdfs.

Example 1:

    • To read a text file named “recent_orders” that exists in hdfs.
      • scala > val ordersRDD = sc.textFile(“recent_orders”)
 

Example 2:

    • To read a text file named “recent_orders” that exists in hdfs and specify the number of partitions (3 partitions in this case).
      • scala > val ordersRDD = sc.textFile(“recent_orders”, 3)
 

Example 3:

    • To read all the contents of a directory named “data_files” in hdfs.
      • scala > val dataFilesRDD = sc.wholeTextFiles(“data_files”)
 

NoteThe RDD returned is in a key,value pair format where the key represents the path of each file, and the value represents the entire contents of the file.

 

Spark filter example

A filter is a transformation operation in Apache Spark, which takes an existing dataset,applies a reducing function and returns data for which the reducing fuction returns a true boolean. Conceptually, this is similar to applying a column filter in an excel spreadhseet, or a “where” clause in a sql statement.

Listed below are a few Spark filter examples

Prerequisite : Create an RDD with the sample data as show below. 
scala > val sampleColorRDD = sc.parallelize(List(“red”, “blue”, “green”, “purple”, “blue”, “yellow”))

Example 1:

    • To apply a filter on sampleColorRDD and only select the color “blue” from the RDD dataset.
      • scala > val filterBlueRDD = sampleColorRDD.filter(color => color == “blue”)

Example 2:

    • To apply a filter on sampleColorRDD and select all colors other than the color “blue” from the RDD dataset.
      • scala > val filterNotBlueRDD = sampleColorRDD.filter(color => color != “blue”)

Example 3:

    • To apply a filter on sampleColorRDD and select multiple colors: red and “blue” from the RDD dataset.
      • scala > val filterMultipleRDD = sampleColorRDD.filter(color => (color == “blue” || color == “red”))

Note1: To perform a count() action on the filter output and validate, type the below at the command line:

scala > filterMultipleRDD.count()

Note2: Once you get familiar with the basics, you could minimize your code by combining transformation and action operations into a single line as such:
scala > sampleColorRDD.filter(color => (color == “blue” || color == “red”)).count()

Spark Scala helpful links

Spark Scala Documentation

This is the official Scala Documentation by Apache Org

Service Offerings by Obstkel

Get to know the AWS Cloud Services offered by Obstkel

Email us at : info@obstkel.com

Copyright 2020 © OBSTKEL LLC. All rights Reserved.
Scroll to Top