OBSTKEL https://www.obstkel.com/ Tue, 21 Feb 2023 20:09:51 +0000 en-US hourly 1 https://wordpress.org/?v=6.2 https://www.obstkel.com/wp-content/uploads/2021/08/obstkel_logo-150x150.webp OBSTKEL https://www.obstkel.com/ 32 32 CloudFormation Intrinsic functions: These 4 are a must know! https://www.obstkel.com/cloudformation-intrinsic-functions Tue, 21 Jun 2022 14:09:32 +0000 https://www.obstkel.com/?p=4647 CloudFormation Intrinsic functions helps you create efficient reusable templates. Mastering a few simple functions to get started.

The post CloudFormation Intrinsic functions: These 4 are a must know! appeared first on OBSTKEL.

]]>
obstkel.com logo

CloudFormation Intrinsic functions: These 4 are a must know!

CloudFormation intrinsic function main picture

I confess! I looked up what “Intrinsic” means in the dictionary. Merrian-Webster defines intrinsic as ” belonging to the essential nature“. 

I finally settled for “built-in”. Intrinsic means built in!

CloudFormation Intrinsic functions just means built in functions.

What are Intrinsic functions in AWS CloudFormation?

Intrinsic functions are built-in functions used in the CloudFormation template to pass runtime values to parameters.

However intrinsic functions are not supported in all sections of the CloudFormation template. There are 3 scenarios when/where you can use CloudFormation Intrinsic functions.

  • Conditions section: An optional section in the CloudFormation template, conditions is used to create stack resources when certain criteria is met.

  • Resources section: The only required section in a CloudFormation template. It contains all the AWS resources you plan on deploying in your stack.

  • Outputs section: Another optional section used to pass values to other stack templates or if you want to view resources and stack related info in AWS Management Console.

List of CloudFormation Intrinsic functions

There are a total of 17 Intrinsic functions, of which 4 are conditional functions.  To limit the scope of this post we will be focusing on the 4 of the important ones:

    1. CloudFormation Intrinsic function Ref
    2. CloudFormation Intrinsic function If
    3. CloudFormation Intrinsic function Join
    4. CloudFormation Intrinsic function Sub
Cloudformation Intrinsic functions mind map

1. CloudFormation Intrinsic function Ref

Ref, short for Reference is used extensively in the Resources and Output sections of the CloudFormation template.

You can use the Ref intrinsic function in 2 contexts –

  •  Parameter: If the input to the Ref function is a parameter it returns the value of the parameter.

  • Resource name: If the input to the Ref function is the logical name of an AWS resource, it returns the physical name of the resource.


    This can sound a bit confusing. So let me clarify.

    Every resource created in your stack needs to be uniquely identifiable. Certain AWS resources can be assigned a user-friendly name, called a Logical ID in your template.

    This is great, but how do you identify resources without a Logical ID?

    To avoid this confusion, CloudFormation, by default, assigns every resource a Physical ID. You then use the logical name in the Ref function to determine the physical name.

The declaration format for Ref Intrinsic function is as shown below.


JSON Format: { "Ref" : "LogicalName" }

 


YAML Format: !Ref LogicalName



Example on using CloudFormation Ref function as a Parameter

In the below example, we create an Amazon Relational Database Service (RDS) instance using a CloudFormation template.

CloudFormation Ref function is used as a parameter to pass Username and Password values from the Parameters section of the template to the Resources section of the template.

 
JSON Format
{
 "Parameters": {
    "DBUser": {
      "NoEcho": "true",
      "Description" : "The database admin account username",
      "Type": "String",
      "MinLength": "1",
      "MaxLength": "16",
      "AllowedPattern" : "[a-zA-Z][a-zA-Z0-9]*",
      "ConstraintDescription" : "must begin with a letter and contain only alphanumeric characters."
    },

    "DBPassword": {
      "NoEcho": "true",
      "Description" : "The database admin account password",
      "Type": "String",
      "MinLength": "8",
      "MaxLength": "41",
      "AllowedPattern" : "[a-zA-Z0-9]*",
      "ConstraintDescription" : "must contain only alphanumeric characters."
    }
  },

  "Resources" : {
    "myDB" : {
      "Type" : "AWS::RDS::DBInstance",
      "Properties" : {
        "AllocatedStorage" : "100",
        "DBInstanceClass" : "db.t2.small",
        "Engine" : "MySQL",
        "Iops" : "1000",
        "MasterUsername" : { "Ref" : "DBUser" },
        "MasterUserPassword" : { "Ref" : "DBPassword" }
      }
    }
  }
}
 

YAML Format
Parameters:
  DBUser:
    NoEcho: 'true'
    Description: The database admin account username
    Type: String
    MinLength: '1'
    MaxLength: '16'
    AllowedPattern: '[a-zA-Z][a-zA-Z0-9]*'
    ConstraintDescription: must begin with a letter and contain only alphanumeric characters.
  DBPassword:
    NoEcho: 'true'
    Description: The database admin account password
    Type: String
    MinLength: '8'
    MaxLength: '41'
    AllowedPattern: '[a-zA-Z0-9]*'
    ConstraintDescription: must contain only alphanumeric characters.
Resources:
  myDB:
    Type: 'AWS::RDS::DBInstance'
    Properties:
      AllocatedStorage: '100'
      DBInstanceClass: db.t2.small
      Engine: MySQL
      Iops: '1000'
      MasterUsername: !Ref DBUser
      MasterUserPassword: !Ref DBPassword
 

 

Example on using Ref function used as a Resource

This second example installs and deploys WordPress on an EC2 instance.

The CloudFormation Ref function is used as a resource to obtain the Physical ID from the Logical ID for the VPC.

 


JSON Format
{
    "Parameters": {
        "VpcId": {
            "Type": "AWS::EC2::VPC::Id",
            "Description": "VpcId of your existing Virtual Private Cloud (VPC)",
            "ConstraintDescription": "must be the VPC Id of an existing Virtual Private Cloud."
        }
    },
    "Resources": {
        "ALBTargetGroup": {
            "Type": "AWS::ElasticLoadBalancingV2::TargetGroup",
            "Properties": {
                "HealthCheckPath": "/wordpress/wp-admin/install.php",
                "HealthCheckIntervalSeconds": 10,
                "HealthCheckTimeoutSeconds": 5,
                "HealthyThresholdCount": 2,
                "Port": 80,
                "Protocol": "HTTP",
                "UnhealthyThresholdCount": 5,
                "VpcId": {
                    "Ref": "VpcId"
                },
                "TargetGroupAttributes": [
                    {
                        "Key": "stickiness.enabled",
                        "Value": "true"
                    },
                    {
                        "Key": "stickiness.type",
                        "Value": "lb_cookie"
                    },
                    {
                        "Key": "stickiness.lb_cookie.duration_seconds",
                        "Value": "30"
                    }
                ]
            }
        }
    }
}
 
 
YAML Format
Parameters:
  VpcId:
    Type: 'AWS::EC2::VPC::Id'
    Description: VpcId of your existing Virtual Private Cloud (VPC)
    ConstraintDescription: must be the VPC Id of an existing Virtual Private Cloud.
Resources:
  ALBTargetGroup:
    Type: 'AWS::ElasticLoadBalancingV2::TargetGroup'
    Properties:
      HealthCheckPath: /wordpress/wp-admin/install.php
      HealthCheckIntervalSeconds: 10
      HealthCheckTimeoutSeconds: 5
      HealthyThresholdCount: 2
      Port: 80
      Protocol: HTTP
      UnhealthyThresholdCount: 5
      VpcId: !Ref VpcId
      TargetGroupAttributes:
        - Key: stickiness.enabled
          Value: 'true'
        - Key: stickiness.type
          Value: lb_cookie
        - Key: stickiness.lb_cookie.duration_seconds
          Value: '30'

 

2. CloudFormation Intrinsic function If

Use the Fn::If intrinsic function to create AWS stack resources based on conditions.

This function is the same as an if…else statement. An ideal instance for using the If intrinsic function would be to create different EC2 instance types depending on the environment (dev, test, prd).

However, unlike the other conditional functions you do not use Fn:If in the Conditional section. You use it in the Resources section and the Output section as shown in the image below. 

cloudformation If intrinsic function image


The syntax for Fn:If is as shown below.


JSON format: "Fn::If": [condition_name, value_if_true, value_if_false]
 

YAML format: !If [condition_name, value_if_true, value_if_false]
 

 

Example on using Fn:If intrinsic function

In this example, we create an Amazon Redshift cluster. The Fn::If intrinsic function is used in the Resources section to determine if it is a multi-node cluster based on the number of nodes.

The below is just an excerpt from the sample template to show how the function works. For the full template, click here.


JSON Format
Resources": {
    "RedshiftCluster": {
      "Type": "AWS::Redshift::Cluster",
      "Properties": {
        "ClusterType": { "Ref": "ClusterType" },
        "NumberOfNodes": { "Fn::If": [ "IsMultiNodeCluster", 
                                { "Ref": "NumberOfNodes" }, { "Ref": "AWS::NoValue" } ] },
        "NodeType": { "Ref": "NodeType" },
        "DBName": { "Ref": "DatabaseName" },
        "MasterUsername": { "Ref": "MasterUsername" },
        "MasterUserPassword": { "Ref": "MasterUserPassword" },
        "ClusterParameterGroupName": { "Ref": "RedshiftClusterParameterGroup" }
      }
  }
}

 
YAML Format
Resources:
  RedshiftCluster:
    Type: 'AWS::Redshift::Cluster'
    Properties:
      ClusterType: !Ref ClusterType
      NumberOfNodes: !If 
        - IsMultiNodeCluster
        - !Ref NumberOfNodes
        - !Ref 'AWS::NoValue'
      NodeType: !Ref NodeType
      DBName: !Ref DatabaseName
      MasterUsername: !Ref MasterUsername
      MasterUserPassword: !Ref MasterUserPassword
      ClusterParameterGroupName: !Ref RedshiftClusterParameterGroup

3. CloudFormation Intrinsic function Join

The CloudFormation Fn::Join intrinsic function creates a single output value in string format by combining multiple input values.

If you want the values in your output string separated by another value (delimiter) you can specify it. Otherwise, your values will be combined without a delimiter in between. Yes, not even white space!

The syntax for the Fn::Join intrinsic function is as shown below.


JSON format: {"Fn::Join" : [ "delimiter", [ comma-delimited list of values ] ] }

YAML format: !Join [ delimiter, [ comma-delimited list of values ] ]



Example on using Fn::Join

For the Fn::Join function example, we look at a sample template on creating an Amazon RDS instance.

The Fn::Join function is used in the Outputs section of the CloudFormation template to generate the JDBC Connection string. For the complete template, click here.

JSON Format

"Outputs" : {
    "EC2Platform" : {
      "Description" : "Platform in which this stack is deployed",
      "Value" : { "Fn::If" : [ "Is-EC2-VPC", "EC2-VPC", "EC2-Classic" ]}
    },

    "MasterJDBCConnectionString": {
      "Description" : "JDBC connection string for the master database",
      "Value" : { "Fn::Join": [ "", [ "jdbc:mysql://",
                 { "Fn::GetAtt": [ "MasterDB", "Endpoint.Address" ] },
                    ":",
                 { "Fn::GetAtt": [ "MasterDB", "Endpoint.Port" ] },
                     "/",
                 { "Ref": "DBName" }]]}
    }
}
 
YAML Format

	Outputs:
	  EC2Platform:
		Description: Platform in which this stack is deployed
		Value: !If 
		  - Is-EC2-VPC
		  - EC2-VPC
		  - EC2-Classic
	  MasterJDBCConnectionString:
		Description: JDBC connection string for the master database
		Value: !Join 
		  - ''
		  - - 'jdbc:mysql://'
			- !GetAtt 
			  - MasterDB
			  - Endpoint.Address
			- ':'
			- !GetAtt 
			  - MasterDB
			  - Endpoint.Port
			- /
			- !Ref DBName	     

4. CloudFormation Intrinsic function Sub

The CloudFormation intrinsic function Sub is used to replace an input string with value(s). The syntax for Fn::Sub is as shown below.


JSON format
{"Fn::Sub" : [ String, { Var1Name: Var1Value, Var2Name: Var2Value } ] }

YAML format
 !Sub
       - String
       - Var1Name: Var1Value
         Var2Name: Var2Value

Pay close attention to the following points.

  • You can define as many variable names as needed (Var1Name, Var2Name…….Var(n)Name).

  • The variable value (VarValue), is specified after the colon.

  • Every variable name and value combination should be specified as a key-value map.

  • The input string denoted as “string” normally contains the variable (VarName) you need to replace.

  • You have to specify the variable name in your string as “${VarName}” for both JSON and YAML formatted templates.

  • For template parameter names, resource logical IDs and resource attributes do not use variables. They are already defined for you.


Example1: Intrinsic function Fn::Sub with variable map

The Sub intrinsic function can be tricky to understand. So, let’s look at a simple example first.


JSON Format
{ "Fn::Sub": [ "${Fruit1} and ${Fruit2}", { "Fruit1": "Apples", "Fruit2": "Oranges" } ] }

 


YAML Format
!Sub - '${Fruit1} and ${Fruit2}' - Fruit1: Apples Fruit2: Oranges


In this example, we have a variable map with 2 variables – Fruit1 and Fruit2. The respective variable values are Apples and Oranges.

The input string in blue font generates the string value ” Apples and Oranges“.

This example was purely for illustration purposes only. For the most part you will use the Sub intrinsic function to substitute template parameter names, resource attributes and resource ID’s.



Example 2: Intrinsic function Sub without variable mapping

In this final example on Fn::Sub we create a file named setup.sql which contains the DDL to create a MySQL database and a user with permissions to the database.

DBName, DBUsername and DBPassword are template parameters defined in the Parameters section of the template.


JSON Format
{ "files": { "/tmp/setup.mysql": { "content": { "Fn::Sub": "CREATE DATABASE ${DBName}; \nCREATE USER '${DBUsername}'@'localhost' IDENTIFIED BY '${DBPassword}'; \nGRANT ALL ON ${DBName}.* TO '${DBUsername}'@'localhost'; \nFLUSH PRIVILEGES;\n" }, "mode": "000644", "owner": "root", "group": "root" } } }

 

 
YAML Format
files:
  /tmp/setup.mysql:
    content: !Sub |
      CREATE DATABASE ${DBName}; 
      CREATE USER '${DBUsername}'@'localhost' IDENTIFIED BY '${DBPassword}'; 
      GRANT ALL ON ${DBName}.* TO '${DBUsername}'@'localhost'; 
      FLUSH PRIVILEGES;
    mode: '000644'
    owner: root
    group: root
 

Summing things up

Sometime the big picture is overwhelming!

So, like I always say, start small! Focus on mastering these 4 CloudFormation Intrinsic functions. As you get better, your confidence will improve. You can then move on to mastering them all.

A few helpful hints and pointers:

  • Try playing around in AWS CloudFormation Designer. It is a developer friendly graphics-based tool to create CloudFormation templates. You can drag and drop AWS resources and modify their properties to create stacks. The best part is you can view the JSON/YAML code being generated in real time. 

  • Understand Pseudo parameters before diving into Intrinsic functions. Pseudo parameters are predefined by AWS and are heavily used in combination with CloudFormation Intrinsic functions. 

Table of Contents

Recent Posts

Interested in our services ?

email us at : info@obstkel.com

Copyright 2022 © OBSTKEL LLC. All rights Reserved

The post CloudFormation Intrinsic functions: These 4 are a must know! appeared first on OBSTKEL.

]]>
What is Apache Spark – explained using mind maps https://www.obstkel.com/apache-spark-concepts Wed, 18 May 2022 20:48:35 +0000 https://www.obstkel.com/?p=1057 This blog post introduces the basic of Apache Spark concepts, with an overview of Resilient Distributed Datasets (RDD), DataFrames and more.

The post What is Apache Spark – explained using mind maps appeared first on OBSTKEL.

]]>

What is Apache Spark – explained using mind maps

Apache Spark main image

If you are a beginner to Apache Spark, this post on the fundamentals is for you. Mastering the basics of Apache Spark will help you build a strong foundation before you get to the more complex concepts.

Often times these concepts are mixed with new terminology. Associating these terminologies using relationships in a hierarchical way helps understand information effectively, and in a shorter time frame. That’s the concept behind mind maps, which is used in this post.

 A mind map is a technique use to organize information visually, increasing the brain’s ability to retain information dramatically. For more information on mind mapping, read this


What is Apache Spark?

Apache Spark modules and use

Apache Spark is a data processing engine. You can use Spark to process large volumes of structured and unstructured data for Analytics, Data Science, Data Engineering and Machine Learning initiatives. 

However, it’s important to note that Apache Spark is not a database or a distributed file system. In other words, it is not meant to store your data. Rather, Spark provides the raw processing power to crunch data and extract meaningful information. Spark provides 4 modules for this –

  • GraphX: Used to process complex relationships between data using Graph theory concepts of Nodes, Edges and Properties.

  • MLlib: An extensive Machine Learning library and tools to build smarter apps and prediction systems.

  • Spark Streaming: For processing real-time data for analytics.

  • Spark SQL: Build interactive queries for batch processing on structured data.


Apache Spark is not limited to a single programming language. You have the flexibility to use Java, Python, Scala, R or SQL to build your programs. Some minor limitations do apply. 

If you are just starting off and do not have any programming experience, I highly recommend starting off with SQL and Scala. They are the 2 most popular and in demand programming languages for Apache Spark!

What is Apache Spark RDD?

RDD in Apache Spark stands for Resilient Distributed Dataset. It is nothing more than your data file converted into an internal Spark format. Once converted, this data is then partitioned and spread across multiple computers(nodes).

The internal format stores the data as a collection of lines. This collection can be a List, Tuple, Map or Dictionary depending on the programming language you chose (Scala, Java, Python).

Resilient Distributed Dataset (RDD) does sound intimidating. So, keep things simple. Just think of it as data in Apache Spark internal format.

Moving beyond theoretical, there are two main concepts you need to grasp for RDD’s.

  • First, how do I convert external data into Spark format (Creating an RDD).

  • Second, now that I created an RDD, how to do operate on it.


Let’s use the Apache Spark mind map below to visualize this process.

mind map of Apache Spark RDD

Creating Spark RDD's from External Datasets

External Datasets in Apache Spark denotes the source data stored externally. This data needs to be read in to create an RDD in Apache Spark. There are 2 key aspects when it comes to external data sets: location and file format.

  • Location: This is where your data is located. Spark can get to your source data from the below sources.
    • Amazon S3
    • Cassandra
    • HBase
    • Hadoop Distributed File System (HDFS)
    • Local File System


  • File Format: Your file format defines the structure of the data in your file. Apache Spark can read source files in the below formats.
    • Avro
    • Optimized Row Columnar (ORC)
    • Parquet
    • Record Columnar File (RCFILE)
    • Sequence files 
    • Text files 

 

Benefit of RDD Persistence in Spark

If an RDD needs to be used more than once, you can choose to save it to disk or memory. This concept is called RDD Persistence

The benefit of RDD Persistence is that you do not have to recreate distributed datasets from external datasets every time. In addition, persisting datasets in memory or disk saves processing time down the road.

Reusing RDD's using Parallelized Collections

As I mentioned earlier, an RDD is any source data expressed as a collection. Persistence allows you to save this data so you can reuse it. 

So, how do you reuse an existing RDD?

You create a Parallelized Collection. A Parallelized Collection creates a new RDD by copying over the elements from an existing collection (RDD). This creates a new distributed dataset. 

In many ways, this is similar to creating a table from another table in a database using a CREATE TABLE AS (CTAS) statement.

Performing Operations on RDD's

Now that you know how to create an RDD and reuse an RDD, let’s look at performing operations on datasets.

Operations are what you do with your collection or the elements in your collection to achieve an end result. There are 4 types of RDD Operations you can perform in Apache Spark.

  • Actions: An action operation returns a value based on computation performed on your RDD. A perfect example would be performing a count on the number of elements in a collection. For a list of Apache Spark actions, click here.


  • Transformations: A transformation operation on the other hand executes a function on an RDD and returns the results as new RDD.  Union, Filter and Map are some common examples of transformation functions. For a detailed list of Apache Spark transformations, click here.


  • Key-Value Pairs: Key-Value pairs are data structures containing 2 elements. The first element is the name, and the second the value. To handle this variation, there are certain RDD operations in Scala (PairRDDFunctions) and Java (JavaPairRDD), specifically designed for Key-Value Pairs. 

Your takeaway from this post

Yes, everything about Apache Spark sounds intimidating! The acronyms, the keywords, the language!

But don’t let that get to you. Approach Spark the same way you would eat an elephant- “One small bite at a time.” 

Your takeaway from this post should be – 

  • An understanding of what Apache Spark is.

  • The different features of Apache Spark.

  • The importance of RDD and how to create one.

  • The different types of Operations you can perform on an RDD (Actions, Transformations).
Spark SQL links
Programming Guide

The official Apache Spark v2.3.2 Spark SQL Programming Guide with everything you need to know in a single place.

Hive Tutorial

Learn how Apache Hive fits into the Hadoop ecosystem with this Hive Tutorial for Beginners on guru99.com.

Related Posts

Table of Contents

Interested in our services ?
email us at : info@obstkel.com
Copyright 2022 © OBSTKEL LLC. All rights Reserved

The post What is Apache Spark – explained using mind maps appeared first on OBSTKEL.

]]>
CloudFormation Pseudo parameters – 8 parameters that pack a punch https://www.obstkel.com/cloudformation-pseudo-parameters Mon, 09 May 2022 14:00:47 +0000 https://www.obstkel.com/?p=4644 Use CloudFormation Pseudo Parameters instead of hard coding values. Pseudo Parameters like Region, StackName, StackID have major advantages.

The post CloudFormation Pseudo parameters – 8 parameters that pack a punch appeared first on OBSTKEL.

]]>

CloudFormation Pseudo parameters – 8 parameters that pack a punch

CloudFormation Pseudo parameters blog page

CloudFormation Pseudo parameters are used in CloudFormation templates.

These parameters are predefined in AWS and their values are automatically set when a stack is created. You do not have to declare or initialize them. All you do is use it.

But why use Pseudo parameters in CloudFormation? What if any is the actual benefit?

What are Pseudo Parameters in CloudFormation?

Pseudo parameters are named variables in a CloudFormation template, whose values are automatically set by AWS. In other words, you do not have to provide an input value. It gets derived from your environmental attributes.

There are a total of 8 CloudFormation Pseudo parameters as shown in the mind map diagram below.  Of them, AWS::Region and AWS:: StackID are the most frequently used Pseudo Parameters.

Let’s visualize an example on Pseudo Parameters. Picture the breakfast items at a fast-food restaurant. They are preset items and limited in options. You pick from the available options. They do not change on an hourly basis or daily basis.

An important point about Pseudo parameters – they are not the same as dynamic references in CloudFormation. This can get confusing, so let me clarify.

Dynamic references are limited to parameters in the AWS Systems Manager Parameter Store (ssn, ssm-secure) and AWS Secrets Manager(secretsmanager). Yes, they are dynamically populated, but are not the same as Pseudo parameters.

 

 

1. AWS:: AccountID

An AWS Account ID is a 12-digit number that uniquely identifies a user’s AWS account. When you create a stack using CloudFormation, the Account ID is part of Amazon Resource Name (ARN).

ARN’s uniquely identifies every resource deployed in AWS. You use this pseudo parameter when you have multiple AWS accounts or want to deploy a stack in a specific account.


JSON Format:
{ "Ref" : "AWS::AccountID" }


YAML Format:
!Ref "AWS::AccountID"


2. AWS::NotificationARNs

The pseudo parameter AWS::NotificationARNs returns the list of Amazon Simple Notification Service(SNS) ARNs to which your stack related events are published. 

Ideally, you use AWS:NotifictionARN’s with the Fn::Select intrinsic function.


JSON Format: { "Ref" : "AWS::NotificationARNs" }

YAML Format:
!Ref "AWS::NotificationARNs"



3. AWS::NoValue

The CloudFormation pseudo parameter AWS:NoValue is used to tell CloudFormation to “do nothing”. The pseudo parameter does not return a value, thus ending the execution of that branch of logic.

AWS::NoValue is used in combination with the Fn::If condition function. Fn::If is basically the same as an if else statement.


JSON Format:
{"Ref" : "AWS::NoValue"}

YAML Format
: !Ref "AWS::NoValue"



4. AWS::Region

An AWS Region represents data centers that are physically isolated in different geographic areas.

Currently Amazon has 26 geographic Regions across North America, South America, Asia Pacific, China, Europe, Middle East and South Africa


JSON Format:
{ "Ref" : "AWS::Region" }


YAML Format
: !Ref "AWS::Region"

 

5. AWS::Partition

An AWS Partition is made up of one or more Regions. This is an important distinction to keep in mind.

You can use the AWS:Partition pseudo parameter to determine the Regions and the Services available within these regions. AWS Currently has 3 valid partitions.

  • Public Partition: Identified as “aws“.
  • AWS GovCoud : Limited to US Regions, this partition is meant for secure Cloud Solutions. Is is identified as “aws-us-gov“.
  • AWS China: Identified as “aws-cn“.


JSON Format:
{ "Ref" : "AWS::Partition" }


YAML Format:
!Ref "AWS::Partition"



6. AWS::StackName

A Stack is a grouping of AWS resources, treated as a single unit and deployed for a specific purpose. Creating a LAMP stack on an EC2 instance with a database or deploying SharePoint on a Microsoft Windows Server are some examples of stacks.

When you create a stack in CloudFormation, you have to assign it a unique name within the Region you are creating it. Once you assign the stack a name, any future references can use the AWS:StackName pseudo parameter.


JSON Format:
{ "Ref" : "AWS::StackName" }


YAML Format:
!Ref "AWS::StackName"

 

7. AWS::StackID

A Stack ID is a unique identifier assigned to a stack. This is not the same as a Stack Name. 

A Stack Name if you recall from above, is the name you assign to a stack on creation. Confused?  Think of a Stack ID as an employee id, and Stack Name as an Employee Name.

Need a second example? A Stack ID is similar to the unique numeric code you find on an apple at a grocery store. While a Stack Name is the name of the fruit ( Fuji apple).

So, when should Stack ID’s be used vs Stack Names ? 

You can use Stack ID’s or Stack Names when you run a stack. However, you absolutely have to use the Stack ID pseudo parameter when you delete a stack.

Best practice – Stick with the AWS::StackID.


JSON Format:
{ "Ref" : "AWS::StackID" }


YAML Format:
!Ref "AWS::StackID" 

 

8. AWS::URLSuffix

The CloudFormation pseudo parameter AWS:URLSuffix returns the region-specific domain for your environment. Currently, there are 2 region specific URL Suffixes.

  • amazonaws.com – for all regions excluding China and AWS GovCloud (US).
  • amazonaws.com.cn – for China.


JSON Format:
{ "Ref" : "AWS::URLSuffix" }


YAML Format:
!Ref "AWS::URLSuffix"

1. Example for CloudFormation Pseudo Parameter AWS::NoValue

The below example is on Creating a basic Amazon Redshift Cluster.

The AWS::NoValue pseudo parameter is used in the Resources section of the CloudFormation template. It returns a NULL or “do nothing” if the Cluster Type is a single-node.

For the complete Redshift Cluster sample template, click here.


"Resources": {
"RedshiftCluster": {
"Type": "AWS::Redshift::Cluster",
"Properties": {
"ClusterType": { "Ref": "ClusterType" },
"NumberOfNodes": { "Fn::If": [ "IsMultiNodeCluster", { "Ref": "NumberOfNodes" },
{ "Ref": "AWS::NoValue" } ] },
"NodeType": { "Ref": "NodeType" },
"DBName": { "Ref": "DatabaseName" },
"MasterUsername": { "Ref": "MasterUsername" },
"MasterUserPassword": { "Ref": "MasterUserPassword" },
"ClusterParameterGroupName": { "Ref": "RedshiftClusterParameterGroup" }
}

2. Example for CloudFormation Pseudo Parameter AWS::Region

The second example is on creating an Amazon EC2 instance with an Elastic IP address. 

Again, the AWS::Region pseudo parameter is used in the Resources section to determine the Image ID for the EC2 instance.  Image ID, short for Amazon Machine Image (AMI) ID is a package of Operating System, Software and Configuration details used to launch your instance.

For the complete Amazon EC2 sample template, click here.


"Resources" : {
"EC2Instance" : {
"Type" : "AWS::EC2::Instance",
"Properties" : {
"UserData" : { "Fn::Base64" : { "Fn::Join" : [ "", [ "IPAddress=", {"Ref" : "IPAddress"}]]}},
"InstanceType" : { "Ref" : "InstanceType" },
"SecurityGroups" : [ { "Ref" : "InstanceSecurityGroup" } ],
"KeyName" : { "Ref" : "KeyName" },
"ImageId" : { "Fn::FindInMap" : [ "AWSRegionArch2AMI", { "Ref" : "AWS::Region" },
{ "Fn::FindInMap" : [ "AWSInstanceType2Arch", { "Ref" : "InstanceType" }, "Arch" ] } ] } }
}
}
}

3. Example using Pseudo Parameters AccountID, StackName

This 3rd example is on Creating a website hosted on Amazon S3. 

Pseudo parameters AWS::AccountID, AWS:StackName and AWS::Region are concatenated using the intrinsic function Fn:Join to create a domain name string for the Name parameter.

For the full sample template in JSON format, click here.


"Resources" : {
"WebsiteDNSName" : {
"Type" : "AWS::Route53::RecordSet",
"Properties" : {
"HostedZoneName" : { "Fn::Join" : [ "", [{ "Ref" : "HostedZone" }, "."]]},
"Comment" : "CNAME redirect custom name to CloudFront distribution",
"Name" : { "Fn::Join" : [ "", [{"Ref" : "AWS::StackName"}, {"Ref" : "AWS::AccountId"}, ".",
{"Ref" : "AWS::Region"}, ".", { "Ref" : "HostedZone" }]]},
"Type" : "CNAME",
"TTL" : "900",
"ResourceRecords" : [{ "Fn::Join" : [ "", ["http://", {"Fn::GetAtt" :
["WebsiteCDN", "DomainName"]} ]]}]
}
}
}

 

4. Example using CloudFormation Pseudo Parameters StackID

In this 4th and final example, we will look at creating a Virtual Private Cloud (VPC) with a single instance of EC2. The AWS::StackID pseudo parameter is used to tag the VPC.

As usual, our focus will be on the Resources section of the CloudFormation template. 

A complete copy of the sample template can be found here.

 


"Resources" : {
"VPC" : {
"Type" : "AWS::EC2::VPC",
"Properties" : {
"CidrBlock" : "10.0.0.0/16",
"Tags" : [ {"Key" : "Application", "Value" : { "Ref" : "AWS::StackId"} } ]
}
}
}

Wrapping things up

CloudFormation Pseudo Parameters are not complicated. From personal experience, I highly recommend using them instead of hard coding values. Make it a best practice! Your peers will love you for it.

Keep the below key points in mind when using Pseudo Parameters.

  • AWS Parameters are not the same as AWS Pseudo Parameters.

  • Pseudo Parameters are not the same as dynamic References.

  • AWS:Region, AWS:StackID and AWS::StackName are the most frequently used CloudFormation Pseudo Parameters.

  • 90% of the time, the Resources section in your CloudFormation template is where you will end up using pseudo parameters.

Table of Contents

Recent Posts

Interested in our services ?

email us at : info@obstkel.com

Copyright 2022 © OBSTKEL LLC. All rights Reserved

The post CloudFormation Pseudo parameters – 8 parameters that pack a punch appeared first on OBSTKEL.

]]>
Redshift materialized views: The good, the bad and the ugly https://www.obstkel.com/redshift-materialized-views Mon, 25 Apr 2022 16:25:09 +0000 https://www.obstkel.com/?p=3411 A comprehensive post on Materialized views in Redshift focused on its best features such as auto rewriting, limitations and best practices

The post Redshift materialized views: The good, the bad and the ugly appeared first on OBSTKEL.

]]>

Redshift materialized views: The good, the bad and the ugly

Redshift materialized views simplify complex queries across multiple tables with large amounts of data. The result is significant performance improvement!

 

What are materialized views?

To derive information from data, we need to analyze it. We do this by writing SQL against database tables. Sometimes this might require joining multiple tables, aggregating data and using complex SQL functions. 

If this task needs to be repeated, you save the SQL script and execute it or may even create a SQL view. A view by the way, is nothing more than a stored SQL query you execute as frequently as needed.

However, a view does not generate output data until it is executed. In other words, if a complex sql query takes forever to run, a view based on the same SQL will do the same. This is where materialized views come in handy.

When a materialized view is created, the underlying SQL query gets executed right away and the output data stored. So, when you call the materialized view, all its doing is extracting data from the stored results.

Think of a materialized view as the best of a table (data storage) and a view (stored sql query).

A Redshift materialized views save us the most expensive resource of all – time.

Best features of Redshift materialized views

Materialized views in Redshift have some noteworthy features. Let’s take a look at a few.

  • More than just tables: Do you have files in AWS S3 you would like to reference? Or maybe you already have a materialized view and need a new one with some additional data?
    In redshift you can create a materialized view to refer data in external tables (AWS S3) and even define one in terms of an existing view.

  • Materialized view on materialized view: Redshift lets you create materialized views based on materialized views you already created. This is similar to reading data from a table and helps avoid duplicating expensive table joins and aggregations.

  • Adding columns: There are more DDL (Data Definition Language) limitations on creating materialized views. However, one bright spot, you can add columns to the internal tables with zero impact to existing materialized views.

  • Automatic query rewriting: For me this is an exciting feature! Redshift automatically rewrites your sql query to use a materialized view (if one exists) even if you do not explicitly use it, thereby improving performance.

  • Incremental refresh: With certain limitations, Redshift lets you perform an incremental refresh (vs a full refresh) on a materialized view. This helps save time.

Redshift materialized view limitations

Redshift materialized views are not without limitations. Let’s take a look at the common ones.

  • Stale data: The data in a materialized view is a point in time snapshot. Any changes to the underlying data will not be reflected unless the materialized view is refreshed.


  • Redshift Create materialized view limitations: You cannot use or refer to the below objects or clauses when creating a materialized view
    • Auto refresh when using mutable functions or reading data from external tables.
    • Late binding or circular reference to tables.
    • Leader node-only functions such as CURRENT_SCHEMA, CURRENT_SCHEMAS, HAS_DATABASE_PRIVILEGE, HAS_SCHEMA_PRIVILEGE, HAS_TABLE_PRIVILEGE.
    • ORDER BY, LIMIT and OFFSET clauses.
    • System Tables.
    • User defined functions.
    • Views.

  • There is no CREATE or REPLACE materialized view Redshift statement. You have to drop the materialized view using DROP MATERIALIZED VIEW ddl first. Then re-create the Redshift materialized view using a CREATE MATERIALIZED VIEW statement.

  • Automatic query rewriting limitations: Query rewriting will not work if your materialized view has the below conditions/functions.
    • Aggregate functions other than SUM, COUNT, MIN, and MAX.
    • CREATE TABLE AS statements.
    • DISTINCT clause.
    • External tables.
    • HAVING clause.
    • LEFT, RIGHT and FULL outer joins.
    • Materialized views referencing other materialized views.
    • References to system tables and catalogs.
    • SELECT INTO statements.
    • Set operations (UNION, INTERSECT, and EXCEPT).
    • Subqueries.
    • Window functions.


  • Auto refresh limitations: If you recall, auto refresh has 2 modes: incremental and full. The only limitation on a full materialized view refresh is – no external tables allowed.

    Incremental refresh on the other hand has more than a few. I have them listed below.

    • Aggregate functions AVG, MEDIAN, PERCENTILE_CONT, LISTAGG, STDDEV_SAMP, STDDEV_POP, APPROXIMATE COUNT, APPROXIMATE PERCENTILE, and bitwise aggregate functions are not allowed.
    • DISTINCT clause.
    • External tables.
    • LEFT, RIGHT and FULL outer joins.
    • Mutable functions – date-time functions, RANDOM and non-STABLE user-defined functions
    • Set operations (UNION, INTERSECT, EXCEPT and MINUS).
    • Temporary tables used for query optimization.
    • Subqueries not part of the FROM clause.
    • Window functions.

6 Best practices for Redshift materialized views

Now that we have a feel for the limitations on materialized views, let’s look at 6 best practices when using them.

  1. Ensure you have SELECT privileges to the underlying tables, schema and permissions to CREATE, ALTER, REFRESH and DROP.

  2. Do not perform the below actions on a materialized view. If you really need to, then drop and recreate it.

    • Renaming a materialized view.

    • Change the data type of a column.

    • Change the schema name to which your tables belong.

    • Alter the underlying SQL statement.

  3. Make sure to refresh all dependent materialized views individually prior to refreshing your main view.  Query system table STV_MV_DEPS for information on materialized view dependencies.

  4. Use STL_EXPLAIN to determine if automatic query rewriting is being used for your query.

  5. Use SVL_MV_REFRESH_STATUS to check the materialized view refresh status as below.

       
    Select * from SVL_MV_REFRESH_STATUS;


    This is an extremely helpful view, so get familiar with it. At a minimum check for the 5 listed details in the SVL_MV_REFRESH_STATUS view.

    1. Who performed the last refresh?

    2. When was the refresh kicked off?

    3. The type of refresh performed (Manual vs Auto).

    4. Status of the refresh (Successful vs Partial vs Failed vs Aborted).

    5. What changes were made during the refresh (Schema vs Table vs Column).


  6. Prefix or suffix the materialized view name with “mv_” or “_mv” based on your accepted naming convention.

Redshift Create materialized view basics

The Redshift CREATE MATERIALZIED VIEW statement creates the view based on a SELECT AS statement. This is very similar to a standard CTAS statement.

A major benefit of this Select statement, you can combine fields from as many Redshift tables or external tables using the SQL JOIN clause.

Let’s look at how to create one. Instead of the traditional approach, I have two examples listed. The first with defaults and the second with parameters set.

It’s a lot simpler to understand this way.

In this first example we create a materialized view based on a single Redshift table. The default values for backup, distribution style and auto refresh are shown below. Note, you do not have to explicitly state the defaults. They are implied.


Example1
: Redshift create materialized view using DEFAULTS. CREATE MATERIALIZED VIEW mv_new_address AS SELECT * from addresses where address_updated ='Y'; BACKUP: YES DISTRIBUTION STYLE: EVEN AUTO REFRESH: NO


In this second example we create the same materialized view but specify the parameter values based on our needs.

The values used in this example are meant to clarify the syntax and usage of these parameters. Be sure to determine your optimal parameter values based on your application needs. 

 


Example2
: Redshift create materialized view with user defined parameter values. CREATE MATERIALIZED VIEW mv_new_address
BACKUP NO
DISTSTYLE KEY
DISTKEY (zipcode)
SORTKEY AUTO
AUTO REFRESH YES
AS SELECT * from addresses where address_updated ='Y';

Trending Questions on Redshift materialized views

1. Are materialized views automatically refreshed in Redshift?

When you create a materialized view, you must set the AUTO REFRESH parameter to YES. If this feature is not set, your view will not be refreshed automatically.

In case you forgot or chose not to initially, use an ALTER command to turn on auto refresh at any time.


2. What is incremental refresh vs full refresh in Redshift?

Both terms apply to refreshing the underlying data used in a materialized view.

In an incremental refresh, the changes to data since the last refresh is determined and applied to the materialized view. On the other hand, in a full refresh the SELECT clause in the view is executed and the entire data set is replaced. 

Conclusion

In summary, Redshift materialized views do save development and execution time. However, it’s important to know how and when to use them. Make sure you really understand the below key areas –

  • Auto refresh vs manual refresh.

  • Automatic query re writing and its limitations.

  • Materialized view on materialized view dependencies.

  • Don’t over think it. You may not be able to remember all the minor details. It’s okay. Practice makes perfect!


Need to Create tables in Redshift?

We have a post on Creating Redshift tables with examples, 10 ways. Most developers find it helpful.

Table of Contents

Interested in our services ?

email us at : info@obstkel.com

Copyright 2022 © OBSTKEL LLC. All rights Reserved

The post Redshift materialized views: The good, the bad and the ugly appeared first on OBSTKEL.

]]>
Redshift Coalesce: What you need to know to use it correctly https://www.obstkel.com/redshift-coalesce Sun, 10 Apr 2022 16:31:06 +0000 https://www.obstkel.com/?p=2253 Coalesce means to combine! Redshift Coalesce is a conditional expression which returns the first non-null value from multiple input values.  The beauty of coalesce is,

The post Redshift Coalesce: What you need to know to use it correctly appeared first on OBSTKEL.

]]>

Redshift Coalesce: What you need to know to use it correctly

redshift coalesce cover image of girl on computer

Coalesce means to combine!

Redshift Coalesce is a conditional expression which returns the first non-null value from multiple input values. 

The beauty of coalesce is, there is no limit to the number of values you can input. It’s a complex sounding word, but one of the most helpful expressions in Redshift.

The syntax for the coalesce function is as below.

coalesce (expression1, expression2, expression3.......)

If you pay close attention, the coalesce function is pretty much an abbreviated version of an if/else statement. The above statement would break down as 

 

if expression1 is null then set to expression2 else 
if expression2 is null then set to expression3........

 
So, when should we use the Coalesce expression?

 

  • Incomplete source data – Scenarios where the data is not always populated. In such cases you may need to add a secondary logic to make the data complete.
  • Derived field – Certain business rules might require field data to be derived based on multiple source fields.
  • Nulls not allowed in target – This is one of the most common use cases for using coalesce. 


As a matter of personal preference, if you don’t like the sound of “coalesce” or think it’s too much to remember, don’t worry. You can use the NVL expression. COALESCE and NVL do the exact same thing.

 

For now, let’s look at 3 nuances of the coalesce function – data types, null values and expression priority. Mastering these will have you using coalesce much more effectively.

Data type impact on Redshift Coalesce

The coalesce expression is data type independent. This means you can use this expression on String, Date and Numerical values as long as all the inputs are of the same type.  The last part needs some clarification.

For example, you cannot have one input field(expression1) of type date, and another one (expression2) of type number.  All source data fields, in the coalesce function should have the same data type.

To keep things simple, I will be referring to CHAR/VARCHAR data types, combinedly as string(s).

Tip #1: The target field data type is the driver – When using the Redshift Coalesce function, make sure all the source data types are the same as the target data type. If they are not, then use a function to convert them to the target data type.

Now let’s look at a few examples. Take a look at the table below and see if come to the same conclusion.

redshift coalesce datatype
  1. The target data type is DATE. Expression1 and 3 are of type DATE. Expression2 is has a value of null. This combination works.

     

  2. The target data type is NUMERIC. Expression1 works, and expression2 is okay since it’s a NULL value. However, expression2 is a date type. This will not work.

     

  3. The target data type for #3 is a VARCHAR. All 3 expressions are in string format and are characters. So, this works.

  4. This example has a target field of type NUMERIC. Expression1 and expression2 looks great. However, expression3 is a number in string format. This combination will not work. But if you were to exclude the quotes around expression3, we have a winner! 


    Related: 10 Redshift Create Table Examples to make you an expert 

 

Table of Contents

Be careful of NULL values when using Coalesce

The Redshift Database Developer Guide defines a null as “If a column in a row is missing, unknown, or not applicable, it is a null value or is said to contain null”.

A simplified version, a null value in a database denotes the absence of data. DATE and NUMERIC data types are the cleanest when it comes to determining null/not null cases. You either have a valid value or you have a null value.  The complexity arises when the data is stored in a VARCHAR/CHAR datatype field. 

A space or blank in a CHAR/VARCHAR field is not the same as null value. Space might not be the value you are looking for, but that does not mean it is a null value.

Likewise, a zero in a numeric field is not the same as a null value.

An incorrect date like 30 February 2022 might be okay in a CHAR/VARCHAR field but will be a null value in a DATE data type field.

In some instances, we may even have junk/garbage characters in a string field.

How, then do we make sure a source field is actually a pure null?

A clever move would be to write a SELECT with a TRIM function and an IS NULL in sql as shown below.

Tip#2: Determine if the source field contains null value(s) using IS NULL in SQL.
     SELECT * FROM orders where TRIM(delivery_status) IS NULL;

Alternately, if you suspect the data has junk/garbage characters, you can add the REGEXP_REPLACE function to the above IS NULL SQL statement as below. 

The function as written, replaces all occurrences of numbers ‘0-9’, upper case letters ‘A-Z’ and lowercase letters ‘a-z’ with a null value. If a junk/garbage character exists in the field, the SELECT statement returns a 1.

Tip#3: Use REGEXP_REPLACE and NOT NULL to determine if a field contains junk/garbage characters.

SELECT 1 FROM orders where
REGEXP_REPLACE(delivery_status, '[[A-Z,a-z,0-9]]*','') IS NOT NULL;

Field order in Redshift Coalesce function matters!

Moving on to the last and final topic on Coalesce – expressions. 

The source fields you choose to be expression1, expression2, expression3 and so on makes a significant difference in the value returned by the Redshift Coalesce function. A technique to ensure you do not end up with incorrect values is to prioritize your expression selection. 

How do you do that?

  • Let the target field requirement be the driver.

  • Use fields in the order of decreasing data quality, i.e., highest to lowest data quality.


Just like you send in the best players first in football, make sure the field with the best data quality is used first. The image below depicts the order in which it is best to select fields in a Coalesce expression.

redshift coalesce expression priority

Redshift Coalesce examples

Let’s look at a few of Redshift Coalesce examples using date, string and numeric data. If you would like to see more variations on coalesce examples, sent us an email at info@obstkel.com

1.Redshift Coalesce in sql example using dates

In this example #2, let’s look at using coalesce on 3 input date columns. Remember in this context column and expression means the same.

  • expression1   order_delivered, a column in a table of type DATE.
  • expression2 – expected_delivery date, a column of type VARCHAR with date stored in format YYYYMMDD.
  • expression3current_date function.
SELECT coalesce ( order_delivered, to_date(expected_delivery,'YYYYMMDD'), current_date) FROM orders;

2.Redshift Coalesce in sql example using strings

In this example #3, let’s look at using coalesce on 3 CHAR/VARCHAR columns. The goal is to determine the email address using a combination of existing data and deriving it if necessary.

  • expression1   email_address, a column in a table of type VARCHAR used to store primary email.
  • expression2 – email_address2, a column of type VARCHAR used to store secondary email addresses.
  • expression3 – This is a concatenation of 2 fields, first_name and last_name. Both fields are stored in the customers table as type VARCHAR.
SELECT coalesce ( email_address,trim(email_address2), last_name + first_name'@obstkel.com') FROM customers;

3.Redshift Coalesce example using numeric data

In this 4th example, let’s look at using coalesce on 3 numeric columns. The goal is to determine the price an item was sold at.

  • expression1 – sale_price, the price at which an item was sold if it was part of a sale. This field is stored as a decimal type.
  • expression2 – recommended_sale_price. This is potential price an item should be sold, should it be part of a sale. This price is stored as a VARCHAR.
  • expression3   current_price, a column in a table of type DECIMAL used to store the current price of an item.
SELECT coalesce ( sale_price, cast(recommended_sale_price as decimal), current_price) FROM items;

In case you are wondering why the columns were listed in that order, go back and review the section on ‘Prioritizing expressions in Coalesce’. 

For this specific example, we are trying to determine the price at which an item was sold.

The sale_price has the highest priority in this context and becomes the first expression. If the item was not listed for sale up to that point, sale_price will be null. Our next best bet is to use the recommended_sale_price, which is stored as a string. You can use either the TO_NUMBER function or the CAST function.

In this case I choose the CAST function to convert the recommended sale price to a decimal. If both the sale_price and recommended_sale_price is null, then we set the item price to the current_price.

Conclusion

Redshift Coalesce is a powerful and straightforward conditional expression. Just make sure you pay attention to the below listed details. 

  • Target column data type – let this field be your driver.
  • Source column data type(s) and how they are stored.
  • Quality of data in source columns.
  • Order of listing expressions in coalesce.

Interested in our services ?

email us at : info@obstkel.com

Copyright 2022 © OBSTKEL LLC. All rights Reserved

The post Redshift Coalesce: What you need to know to use it correctly appeared first on OBSTKEL.

]]>
15 Redshift date functions frequently used by developers https://www.obstkel.com/redshift-date-functions Tue, 29 Mar 2022 17:16:21 +0000 https://www.obstkel.com/?p=2322 Redshift date functions with syntax, definition and examples on the most frequently used 15 by developers.

The post 15 Redshift date functions frequently used by developers appeared first on OBSTKEL.

]]>

15 Redshift date functions frequently used by developers

redshift date functions

This post on Redshift date functions is intended to simplify the core list of date functions. The 15 date functions with examples are the most commonly used ones by Redshift developers. If you need to reference the full list of date and timestamp functions, click here

Before we get started, a few basics.

  • The default Redshift date format is YYYY-MM-DD.

  • To convert a date to a string use the Redshift to_char function as below.
to_char(current_date,'YYYYMMDD') = '20220407'

  • Similarly, to convert a string to date use the Redshift to_date function.
to_date('20200407','YYYYMMDD') = 2022-04-07

 

The table below lists all 15 Redshift date functions in alphabetic order.

Importance of datepart in Redshift date functions

You will see datepart mentioned in more than a few Redshift date functions. They all refer to the same argument. Though simple, its syntax and usage can get confusing. So, let’s clear things up.

Date part is an argument used in Redshift date functions. It is a single lowercase word (datepart) used to denote a part of a date. This could be day, month, year and so on.

The value for the datepart argument is specified without quotes and in lowercase. For example, month is specified as mon.

The table below lists the different date parts and values in alphabetic order. The values listed are not the complete list. They were chosen for being intuitive and unique to remember.

You can also get the complete date part list from the Redshift documentation.

Date part (datepart)Value to use in date function
Centuryc
Dayd
Day of Weekdow
Day of Yeardoy
Decadedec
Epochepoch
Hourhr
Microsecondmicrosec
Millenniummil
Millisecondmillisec
Minutemin
Monthmon
Quarterqtr
Secondsec
Weekw
Yearyr


Another point to clarify, Redshift datepart is not the same as date_part. The first (datepart) is an argument, while the second(date_part) is a date function in Redshift.

1. add_months

Syntax: add_months(date, integer)

What it does: The Redshift add_months  function adds months, specified by integer to a date value.

You can also use add_months to subtract months by specifying a negative integer.

When using this function, do not think in terms of days. For instance, if you add a month to the 31st of January, with add_months, the returned date will be the 28th of February. So once again, think in terms of number of months and not days.  

Example1: Add two months to a date
SELECT add_months('2022-03-01',2);
Output from SQL statement: 2022-05-01
Example2: Subtract two months from a date
SELECT add_months('2022-03-01',-2);
Output from SQL statement: 2022-01-01

2. current_date

Syntax: current_date

What it does: The Redshift current date function returns today’s date in the format YYYY-MM-DD from your session time zone. 

SELECT current_date;
Output from SQL statement: 2022-03-25

3. date_cmp

Syntax: date_cmp(date1, date2

What it does: Redshift date_cmp compares 2 dates and returns 1 if date1 is greater than date2, -1 if date1 is less than date2 and 0 if both dates are equal. This function is a simplified version of the interval_cmp function.

Example1: date1 greater than date2
SELECT date_cmp('2022-03-25', '2022-03-10');
Output from SQL statement: 1
Example2: date1 less than date2
SELECT date_cmp('2022-03-10', '2022-03-25');
Output from SQL statement: -1
Example2: date1 equals date2
SELECT date_cmp(current_date, trunc(sysdate));
Output from SQL statement: 0

4. date_part_year

Syntax: date_part_year(date)

What it does: For a given date, the date_part_year function returns the year portion of the date in the format YYYY.

You can also use the date_part function to get year from a date. Unlike the date_part function, the date_part_year function only requires you to specify a date.

SELECT date_part_year(current_date);
Output from SQL statement: 2022

5. Redshift dateadd

Syntax: dateadd(datepart, interval, date)

What it does: The Redshift dateadd function returns a date plus number specified by interval for a given datepart.

It helps to think of the syntax as

dateadd( "What to add?", "How many to add?", "Which date to add to?")


Following the above simplified pseudo syntax, adding 30 days to the current date looks like this.

dateadd( days, 30, current_date)


Now let’s look at some real examples for Redshift dateadd.

     
  Example1: Add 5 days to a date using Redshift dateadd.

  SELECT date_add(days, 5, '2020-08-16');
  Output from SQL statement: 2020-08-21

     
  Example2: Subtract 5 days from a date using dateadd in Redshift.

  SELECT date_add(days, -5, '2020-08-16');
  Output from SQL statement: 2020-08-11

     
  Example3: Add 6 months to a date using dateadd function.

  SELECT date_add(month, 6, '2020-08-16');
  Output from SQL statement: 2021-02-16

6. Redshift datediff

Syntax: datediff (datepart, date1, date2)

What it does: The Redshift datediff function returns the difference between two dates (date1 and date2) in the format specified by datepart.

The below 4 points are important if you want to use the Redshift datediff function correctly.

  1. If the first date is less than the second date, the result will be a negative number.

  2. Similarly, if the first date is greater than the second date you get a positive number.

  3. If you do not care for the signed part, then use the absolute value function (abs).

  4. Redshift datediff does not return the cumulative difference between two dates. Rather, it returns the difference between the dates specified by datepart.

    For example, if date1 equals 2022-July-04 and date2 equals 2021-July-04, you would expect a difference of one year when you use the datediff function. 

    However, this is not the case. 

    The difference in dates returned depends on the value provided to datepart as shown below.

    • datediff(day,'2022-07-04', '2021-07-04') = 0 days
    • datediff(month,'2022-07-04', '2021-07-04') = 0 months
    • datediff(year,'2022-07-04', '2021-07-04') = 1 year

The table below lists the most commonly used datepart formats for the Redshift datediff function. Note, you have to specify the datepart without quotes as listed in the syntax column.

Date partSyntax
Dayd, day
Weekw, week
Monthmonth
Quarterqtr
Yearyr, year
Epochepoch
Decadedecade
Centuryc, cent, century
Millenniummil, millennium
Example1: Difference between two dates in days
SELECT datediff(day, '2020-08-16', 2020-08-26');
Output from SQL statement: -10
Example2: Difference between two dates in days across multiple years and months
SELECT datediff(day, '2020-08-26', 2019-07-16');
Output from SQL statement: 10

7. date_part

Syntax: date_part (datepart, date)

What it does: The Redshift date_part function returns a part of the date specified by datepart.

The date_part function is frequently used to get a month from date or year from date. 

I’ve mentioned this before but will do it again. Do not confuse date_part and datepart to be the same. They are not!

One is a function (date_part), while the other(datepart) is an argument in a function.

 

     
  Example1: Get day(s) from date using Redshift date_part.

  SELECT date_part(d, '2022-07-06');
  Output from SQL statement: 6

     
  Example2: Redshift get month from date using date_part.

  SELECT date_part(mon, '2022-07-07');
  Output from SQL statement: 7

     
  Example3: Redshift get year from date using date_part function.

  SELECT date_part(yr, '2022-07-07');
  Output from SQL statement: 2022

 

8. Redshift date_trunc

Syntax: date_trunc (‘datepart’, timestamp)

What it does: For a given timestamp, this function truncates the part specified by datepart. Note, the datapart for this function needs to be enclosed in single quotes.

SELECT date_trunc('YEAR', '2020-08-15T06:25:10.234'); 
Output from SQL statement: 2020-01-01 00:00:00

9. extract

Syntax: extract (datepart FROM timestamp)

What it does: Number 9 on the list of Redshift date functions is the extract function. This is a versatile function and one I use frequently.

This function returns the extracted portion of day, month, week, year or time specified by datepart from a given timestamp.  

Example: Extract Day from current date
SELECT extract(day from current_date); 
Output from SQL statement: 27

10. getdate() or sysdate

Syntax: getdate() / sysdate

What it does: Returns the current date and time from the session time zone.

By default, you should use the current_date function for Redshift current date. However, you could also use the trunc function on either the getdate() or sysdate. The trunc function will get rid of the time part. Example3 below illustrates how to extract the current date.

An important Redshift sysdate vs getdate() difference to keep in mind. Even though both functions return date and time information, sysdate returns the date and time for the transaction being executed vs getdate() returns the date and time information of the current statement within the transaction. 

For example, use getdate() if you need to determine the execution time between different sql statements in a transaction. If you do not care for the time portion, then it does not matter if you use getdate() or sysdate. 

Example1: Using sysdate 
SELECT sysdate;
Output from SQL statement: 2011-07-21 10:32:38.248109
Example2: Using getdate() 
SELECT getdate();
Output from SQL statement: 2011-07-21 10:32:38.248109
Example3: How to get Redshift Current date from getdate() or sysdate
SELECT trunc(getdate());
Output from SQL statement: 2011-07-21
SELECT trunc(sysdate);
Output from SQL statement: 2011-07-21

11. interval_cmp

Syntax: interval_cmp(interval1, interval2)

What it does: This Redshift date function does the below: 

  • Returns 1: If interval1 is greater than interval2.
  • Returns -1: If interval1 is less than interval2.
  • Returns 0: If interval1 equals interval2.


An interval literal is used to denote a specific quantity of time. For example, 6 days, 9 weeks, 3 years

You specify an interval in quotes with a space between the quantity and the datepart

As an example, an interval of 6 months would be specified as ‘6 mon’. 

Example1: Interval1 greater than interval2
SELECT interval_cmp('3 years', '1 year');
Output from SQL statement: 1
Example2: Interval1 less than interval2 using date part abbreviation 
SELECT interval_cmp('1 y', '3 yrs');
Output from SQL statement: -1
Example2: Interval1 equals interval2 
SELECT interval_cmp('7 days', '1 week');
Output from SQL statement: 0

12. last_day

Syntax: dayofweek(date)

What it does: Returns the day of week for a given date or timestamp. This can get a bit tricky if you don’t get the basics right. The 15th of September, 2020 for example falls on the third week of the month, and is the third day of the week.

SELECT dayofweek('2020-09-15');
Output from SQL statement: 3

13. months_between

Syntax: months_between (date1, date2)

What it does: Returns the number of months between 2 dates.
Keep in mind, if the first date is earlier than the second date, then a negative number is returned. You can avoid this by specifying the later date first or by using the Redshift absolute value function (abs).

Example 1:
SELECT months_between('2022-03-20', '2022-02-20');
Output from SQL statement: 1
Example 2: Case where the lesser date is first and greater date is second.
SELECT months_between('2022-02-20', '2022-03-20');
Output from SQL statement: -1

 

Workaround option:
SELECT abs(months_between('2022-02-20', '2022-03-20'));
Output from SQL statement: 1

14. next_day

Syntax: next_day(date, day)

What it does: The Redshift next_day function returns the first occurrence of the day after the specified date. 

The day portion of the function can be specified in the below formats:

 

DayMinimum string length for dayExample
Monday, Wednesday, Friday1M or Monday
W or Wednesday
F or Friday
Tuesday, Thursday, Saturday, Sunday2Tu or Tuesday
Th or Thursday
Sa or Saturday
Su or Sunday
Example1
SELECT next_day('2022-03-20', 'M'); 
or
SELECT next_day('2022-03-20', 'Monday');
Output from SQL statement: 2022-03-21
Example2
SELECT next_day('2022-03-20', 'Tu');
Output from SQL statement: 2022-03-22

15. trunc

Syntax: trunc(timestamp)

What it does: The last on the list of Redshift date functions, this function returns the date portion of a given timestamp. Simple !

SELECT trunc('2022-02-21 11:21:42.248017');
Output from SQL statement: 2022-02-21

Additional Amazon Redshift links

Redshift Create table examples

10 examples on how to create tables in Redshift

Amazon Redshift Database Developer Guide

Link to the official current version from AWS

Interested in our services ?

email us at : info@obstkel.com

Copyright 2022 © OBSTKEL LLC. All rights Reserved

The post 15 Redshift date functions frequently used by developers appeared first on OBSTKEL.

]]>
What is Amazon Redshift explained in 10 minutes or less https://www.obstkel.com/what-is-amazon-redshift https://www.obstkel.com/what-is-amazon-redshift#respond Wed, 13 Oct 2021 16:12:15 +0000 https://www.obstkel.com/?p=1810 Amazon Redshift is more than just an optimized relational database, able to handle structured, unstructured and semi-structured data

The post What is Amazon Redshift explained in 10 minutes or less appeared first on OBSTKEL.

]]>

What is Amazon Redshift explained in 10 minutes or less

main image for what is amazon redshift

In simple words, Amazon Redshift or AWS Redshift is a Cloud based Data Warehouse service by Amazon Web Services (AWS).

There are two terminologies to pay attention to here – Cloud and Data Warehouse.

what is amazon redshift

Cloud, short for Cloud Computing, refers to computing resources provided by a third party. These computing resources range from processing power, storage, applications to more complex SaaS, PaaS and IaaS. 

A Data Warehouse is a repository to store large amounts of historical data meant for generating reports and performing analytics. If the data is in a structured or semi-structured format, then you can store it in Redshift.

Traditionally, most data warehouses are hosted on premise and managed by a team of System Administrators and Database Administrators. However, Amazon Redshift as a fully managed cloud service handles all aspects of scaling, capacity provisioning, cluster backup, patching and upgrading. That makes a huge difference!

The benefits of Cloud Computing are immense; however, for the sake of simplicity let’s just say it saves you a lot of money and heartburn.  

What type of database is Amazon Redshift?

Amazon Redshift is a Relational Database Management System (RDBMS) built upon PostgreSQL.

PostgreSQL, if you are not familiar, is a highly robust open-sourced Object-Relational database. It is popular with large companies like Apple, Instagram, Reddit, Skype and Twitch. That’s just naming a few.

However, just because the Redshift database is built upon PostgreSQL does not mean they are the same. Amazon Redshift db is highly optimized for Business Intelligence (BI) and Online Analytical Processing (OLAP).

Some of the optimizations include:

  • Data storage: Redshift Database uses Columnar storage for database tables Instead of storing an entire row of data from a database table in a block, in Columnar storage, the entire column gets stored in the block.

    For instance, consider a database table on customer address, with 200 rows and 10 columns. Let’s assume the 5th column stores the ZIPCODE. In Columnar storage, the entire data for the ZIPCODE column gets stored in a single column. This provides better performance on SQL execution and storage.

    Why do you think SQL executions against these tables are faster?

    Because SQL queries for analytics are normally limited to certain columns and never the entire row. With the data for the entire column stored in a single block, we have fewer blocks to read/write.

  • Data Compression: Compressing data saves storage space. Redshift by default compresses the columns in the table using RAW, AZ64 or LZO encoding. The encoding type is chosen based on the data type of the columns.


  • Query engine: The query execution engine leverages Redshift specific Massive Parallel Processing (MPP), Results caching and Compiled code distribution feature in addition to the columnar storage to increase execution speed, reduce execution time and improve system performance. 

What is the Amazon Redshift difference?

  • The Redshift architecture difference: At its core, Amazon Redshift is made of clusters. A cluster in turn is made up of one or more nodes. These nodes can be categorized into leader nodes and compute nodes.

    The leader node does the job of coordination and communication(engine), while the compute node does the heavy lifting (database).
amazon redshift architecture
  • Redshift support for unstructured data. You already know Amazon Redshift can handle semi-structured data in addition to the standard structured data, which is great! If you have a vast amount of unstructured data and want to generate analytics from it, Redshift has a solution for you.

    Say hello to Amazon Redshift Spectrum!

    Redshift Spectrum is a feature of Amazon Redshift which lets you query unstructured data stored in Amazon S3. You do not even have to load the data into the Redshift database. Matter of fact, you can even use Redshift Spectrum to query your structured and semi-structured data straight from Amazon S3.


  • Redshift tables are not the same. Tablespaces, table partitioning, and inheritance are not supported in Redshift. This might sound strange, but it really helps to improve performance.


  • Some table constraints are informational. Primary keys, Foreign keys and Unique constraints are not enforced in Redshift. This means if you input bad data into your application, it gets stored in the database.


  • Data in a Redshift table is stored in sorted order. When you load data into a Redshift table, the data is stored in sorted order. The sort order is determined by the sort keys specified when you create a table.

    If this makes you scratch your head, dont worry! The sorted data compliments the Redshift columnar storage to give us highly efficient querying capabilities.


Related: Learn how to create tables in Redshift using examples

What is Amazon Redshift pricing model?

Pricing with any AWS Service is based on a Pay-as-you-go model. Similar to your water or electricity bill, you only pay for services used for the duration of the usage, without the need to sign any long-term contracts.

AWS offers a lot of flexibility when it comes to Amazon Redshift pricing. The best approach to maximize these benefits is to think in terms of environments: Sandbox/Prototyping, Development, Testing, Staging and Production. 

infographic on amazon redshift price
  • Sandbox/ Prototyping environment

    If you are playing around with the idea of Redshift, want to understand its features & functionality or build a quick prototype, consider the AWS Free Tier trial version of AWS Redshift.

    With this option you get upto 750 hours of free usage per month, for two months.

     

  • Development/ Test/Staging environment(s)

    These environments do not require to be up and operational 24/7. Your best option is to use an On demand instance (Pay-as-you-go) pricing. With this option, you can pay by the hour and shut down instances when not in use, or when you do not need them any more, so you don’t get billed.

    If On-Demand instance is what you opt for, then you need to think of Amazon Redshift pricing in terms of Compute, Storage and Data Transfer as shown below. 

ComputeStorageData Transfer
Dense Compute (DC2)
Dense Storage (DS2)
RA3 with Redshift Managed Storage
Redshift Managed
Additional Backup
Redshift Spectrum
  • Production environment(s)

     You want these environments to be up and operational with very little downtime. So, Reserved Instances are the best for these environments.

    AWS lets you choose instances for a 1–3-year term, and oftentimes, they can end up being cheaper than the Pay-as-you-go option.


An important point to remember, with AWS Reserved Instances, you are charged for the instances, for the term you signed up for, regardless of if you use them or not. The best part, the price includes two additional copies of your data, and AWS takes care of availability, backup, durability, monitoring, security and maintenance.

For additional details on Amazon Redshift pricing for reserved nodes, click here.


By now you should have a high-level understanding on how to approach Amazon Redshift pricing. Since cost can change, I recommended using the AWS Pricing Calculator for Amazon Redshift to get the most up-to-date details on pricing. 

Recent Posts

Redshift helpful links

Amazon Redshift Documentation

This is the latest version of Redshift Documentation

Get started with Amazon Redshift Spectrum

Learn how to create external tables, schema and query data using Spectrum

Table of Contents

Interested in our services ?

email us at : info@obstkel.com

Copyright 2022 © OBSTKEL LLC. All rights Reserved

The post What is Amazon Redshift explained in 10 minutes or less appeared first on OBSTKEL.

]]>
https://www.obstkel.com/what-is-amazon-redshift/feed 0
SQL Add a New Column: 4 ways with examples https://www.obstkel.com/sql-add-a-new-column https://www.obstkel.com/sql-add-a-new-column#respond Wed, 15 Sep 2021 15:43:28 +0000 https://www.obstkel.com/?p=1655 SQL How to Add a column explained using DROP, ALTER, CTAS and CREATE methods. 4 examples to sql add column in a database agnostic way

The post SQL Add a New Column: 4 ways with examples appeared first on OBSTKEL.

]]>

SQL Add a New Column: 4 ways with examples

girl on sql add columns

In this post on how to SQL add a new column, let us look at 4 different approaches to adding columns to a database table. But first, let’s start with some context.

A column is the smallest unit for capturing an object’s attribute. Let that sink in!

An attribute is nothing more than a property. For instance, if I want to capture information about a Person, what type of information would make sense?

What would uniquely define a person?

Name definitely, Height, Weight, Age, Gender, Race, Date of Birth just to name a few. It’s starting to make sense, isn’t it? 

Each of these attributes of a person is stored in a separate column. A grouping of these columns specific to a single person then constitutes a row in a table.

To limit the scope of this post, I won’t go past the above explanation. However, if you like to learn more, email me at info@obstkel.com and maybe I will write up another post.

Moving on to the next step, how do we sql add column to a table?  

For that you need to use DDL commands!

DDL stands for Data Definition Language and is associated with defining objects in a database. Depending on the database vendor, the commands considered DDL can vary slightly. However, CREATE, ALTER and DROP are considered DDL universally. 

The syntax used in these examples are Oracle specific. However, you should be able to use the same in any other database with minimum tweaks. 

Lastly, the techniques mentioned in the next 4 examples are the same if you want to sql add multiple columns or just a single column.

Now let’s dive into the below create column sql examples.

  1. SQL add a new column with CREATE TABLE
  2. SQL add a new column using DROP and Re-CREATE
  3. SQL add a new column using ALTER TABLE
  4. SQL add a new column using CREATE TABLE AS (CTAS)

1. SQL add a new column with CREATE TABLE

The best way to add column(s) is when creating a table using the CREATE DDL. 

Likewise, a best practice would be to set column default values when you create your table. This; however, requires planning your data model well ahead of time. 

The example below shows how to create a table using CREATE DDL and add default column values for date fields, varchar and integer data types at the same time.

CREATE TABLE employees
(
employee_id      integer(30),
first_name       varchar(30) default 'John',
last_name        varchar(30) default 'Doe',
email            varchar(60) default 'john.doe@xyz.com',
phone            varchar(15) default '000-000-0000',
hire_date        date        default '1901-01-01',
sales_id         integer     default 0
);

2. Add new column(s) in SQL using DROP and Re-CREATE

In some cases, you do not really care about the data in your table. For instance, if you are working in a development environment and have dummy data or scrambled data. If this defines your situation, then the best option is to just copy the DDL for the existing table, drop the table and then recreate it with the new fields you need. The steps to follow this approach are listed below.

STEP 1: Copy the table DDL into a text editor

STEP 2: After that, drop the table

DROP TABLE employees;

STEP 3: Recreate the table with the new columns. In this case we add MIDDLE_NAME, SUFFIX and DOB to the new DDL.

In addition, you can specify column default values as we did in the previous example.

CREATE TABLE employees
(
employee_id      integer(30),
first_name       varchar(30) default 'John',
middle_name     varchar(30),
last_name        varchar(30) default 'Doe',
suffix     varchar(10),
email            varchar(60) default 'john.doe@xyz.com',
phone            varchar(15) default '000-000-0000',
dob     date,
hire_date        date        default '1901-01-01',
sales_id         integer      default 0
);

3. SQL add a new column using ALTER TABLE

Now let’s look at a third option to sql add a new column using an ALTER TABLE command.

If you have data in your table and do not want to lose it, or any of the constraints and permissions, then an ALTER TABLE command is the best. You can add a single column or multiple columns with constraints and data type to a single table using this statement. 

However, do keep in mind that an ALTER TABLE adds the column to the end of the table as the last column. 

Using the EMPLOYEES tables from the previous example, lets add MIDDLE_NAME to this table.

OPTION1 : Adding a single column with constraint and data type to a table.

ALTER TABLE employees 
ADD middle_name varchar2(30) NOT NULL;

OUTPUT: The table created from running the ALTER TABLE statement is shown below. Pay close attention to how the newly created column is appended to the end of the table. 

CREATE TABLE employees
(
employee_id      integer(30),
first_name       varchar2(30) default 'John',
last_name        varchar2(30) default 'Doe',
email            varchar2(60) default 'john.doe@xyz.com',
phone            varchar2(15) default '000-000-0000',
hire_date        date        default '1901-01-01',
sales_id         integer      default 0,
middle_name     varchar2(30) NOT NULL
);

OPTION 2 : Lets look at an example on how to add multiple columns in sql to a table using the ALTER TABLE statement.

ALTER TABLE employees 
ADD
(middle_name varchar2(30) NOT NULL,
suffix varchar2(10) NOT NULL,
dob date);

OUTPUT: The resulting table from executing the above ALTER TABLE statement is shown below. Once again, fields MIDDLE_NAME, SUFFIX and DOB are added to the end of the table. 

CREATE TABLE employees
(
employee_id      integer(30),
first_name       varchar2(30) default 'John',
last_name        varchar2(30) default 'Doe',
email            varchar2(60) default 'john.doe@xyz.com',
phone            varchar2(15) default '000-000-0000',
hire_date        date        default '1901-01-01',
sales_id         integer      default 0,
middle_name     varchar2(30) NOT NULL,
suffix varchar2(10) NOT NULL,
dob date

);

4. SQL add a new column using CREATE TABLE AS (CTAS)

The fourth and final way to sql add a new column to a table is using a CREATE TABLE AS (CTAS) statement. This is an advanced technique and might get you frustrated. But if you are trying to expand your SQL skills, definitely give this approach a shot.

CTAS creates a table based on a Select statement from another table. A little-known feature that most developers do not realize is that you can utilize a CTAS statement to SQL add a column to your table. Matter of fact you can add a column anywhere you please – the beginning, middle or the end. Dealers’ choice!

Let’s continue working with the EMPLOYEE table and assume that we want to add a couple of new columns – MIDDLE_NAME, SUFFIX and DOB (date of birth) in the 3rd, 5th and 8th position.

STEP 1: Select the fields you want from your table. In this case we select from the EMPLOYEES table

SELECT employee_id,first_name, last_name,email,phone,hire_date,sales_id 
FROM employees;

STEP 2: Add columns MIDDLE_NAME, SUFFIX and DOB in the position(s) you want

SELECT employee_id, first_name, middle_name, last_name, suffix, email, 
phone, dob, hire_date, sales_id 
FROM
employees;

STEP 3: Now comes the tricky part. We have to set a datatype for these new fields. For this we shall leverage the CAST function.  

SELECT 
employee_id,
first_name,
CAST (NULL as VARCHAR2 (30)) as middle_name,
last_name,
CAST (NULL as VARCHAR2 (10)) as suffix,
email,
phone,
CAST (NULL as DATE ) as dob,
hire_date,
sales_id
FROM employees;

STEP 4: The final create table as statement to sql add new columns is listed below. Couple of point to keep in mind.

  • You have to give the newly created table a different name initially. In this case we named it “EMPLOYEES_NEW”
  • EMPLOYEES_NEW now contains all the data that exists in the old EMPLOYEES table
  • After creating EMPLOYEES_NEW table, you have to drop the old table using “DROP TABLE EMPLOYEES;”  and rename the table by issuing “RENAME EMPLOYEES_NEW to EMPLOYEES;”
  • Keep in mind, any indexes, constraints and permissions on the initial EMPLOYEES table would needed to be granted again
  • Lastly, please make sure you explore within our own schema and not anywhere near a production environment. 
CREATE TABLE employee_new AS
     SELECT
          employee_id,
          first_name,
          CAST (NULL as VARCHAR2 (30)) as middle_name,
          last_name,
          CAST (NULL as VARCHAR2 (10)) as suffix,
          email,
          phone,
          CAST (NULL as DATE) as dob,
          hire_date,
          sales_id
     FROM employees;

Trending questions on adding a new column

How to sql add column default value?

Setting defaults when creating a table is an absolute best practice. It helps us avoid NULL values and to some extend helps improve data quality. 

First, use the keyword ‘default’ when creating your table ddl. Then depending on the data type set your column default values. For character/string values and date fields, make sure to enclose the value in single quotes. Numeric column defaults do not have to be enclosed in single quotes.

Refer the first example on Create table for syntax and usage.


 

How can I sql create new column in query output?

When you query a table, you are not changing the structure of the underlying table. Creating a new column in query output is as simple as adding a column alias.

You create a column alias by using the ‘as‘ keyword as shown in the image below.

sql create new column in query output

Wrapping things up

So let’s sum up the key takeaways from this post.

  • A single command to sql add a new column does not exist. You have to use it in conjunction with an ALTER TABLE command.

  • If you are frequently adding columns to your table, consider creating a child table with the new columns. 

  • The fastest technique to sql add columns is using the CTAS method. However, for this, you need to have a table or two with the columns you need.

And finally, a plug in for our post on Amazon Athena !

SQL is not limited to relational databases. 

For instance, Spark SQL, a module of Apache Spark lets users query structured data using a similar syntax. Similarly, Amazon Athena, a web service by AWS lets users write SQL against data stored in files

SQL helpful links

Oracle Database SQL Reference

For Oracle SQL syntax, expressions and function reference

MySQL Reference Manual

For MySQL related syntax, statements and examples

Transact SQL (T-SQL) Reference

For SQL Server transact SQL functions, examples and syntax

Interested in our services ?

email us at : info@obstkel.com

Copyright 2022 © OBSTKEL LLC. All rights Reserved

The post SQL Add a New Column: 4 ways with examples appeared first on OBSTKEL.

]]>
https://www.obstkel.com/sql-add-a-new-column/feed 0
Athena SQL basics – How to write SQL against files https://www.obstkel.com/amazon-athena-sql https://www.obstkel.com/amazon-athena-sql#respond Tue, 14 Sep 2021 20:05:04 +0000 https://www.obstkel.com/?p=1457 Athena SQL is the query language used in Amazon Athena to interact with data in S3. Mastering Athena SQL is not a monumental task if

The post Athena SQL basics – How to write SQL against files appeared first on OBSTKEL.

]]>

Athena SQL basics – How to write SQL against files

athena sql picture

Athena SQL is the query language used in Amazon Athena to interact with data in S3. Mastering Athena SQL is not a monumental task if you get the basics right. There are 5 areas you need to understand as listed below.

    1. Athena Data Types
    2. Athena SQL Operators
    3. Athena SQL Functions
      1. Aggregate Functions
      2. Date Functions
      3. String Functions
      4. Window Functions
    4. Athena SQL DDL
    5. Athena SQL DML
 

Before we get to the SQL part, lets make sure you have a good understanding of what Amazon Athena is.

What is Amazon Athena?

Amazon Athena is a web service by AWS used to analyze data in Amazon S3 using SQL.
It runs in the Cloud (or a server) and is part of the AWS Cloud Computing Platform.

In many respects, it is like a SQL graphical user interface (GUI) we use against a relational database to analyze data. The main difference is Amazon Athena helps you read and analyze data in files using SQL instead of data stored in a database.

What makes Amazon Athena different?

The key difference, unlike traditional SQL queries that run against tables in a database Amazon Athena runs against files. Athena can analyze structured, unstructured and semi-structured data stored in an S3 bucket. It can read Apache Web Logs and data formatted in JSON, ORC, Parquet, TSV, CSV and text files with custom delimiters.

Secondly, Amazon Athena does not store the data being analyzed. Athena does have the concept of databases and tables, but they store metadata regarding the file location and the structure of the data.

Thirdly, Amazon Athena is serverless, which means provisioning capacity, scaling, patching, and OS maintenance is handled by AWS. And finally, Athena executes SQL queries in parallel, which means faster outputs.

1. Athena Data types

A Data Type defines the attributes of a value. It also classifies the SQL operations that can be performed on a value. For example, an Athena data type of DATE denotes that a value is a date, and should contain Year, Month and Day information. It also means only DATE related SQL operations can be performed on that value.

Similar to defining Data Types in a relational database, AWS Athena Data Types are defined for each column in a table. These data types form the meta data definition of the dataset, which is stored in the AWS Glue Data Catalog. 

AWS Athena has 18 distinct data types, which are listed below in alphabetical order.

  • ARRAY
  • BIGINT
  • BINARY
  • BOOLEAN
  • CHAR
  • DATE
  • DECIMAL
  • DOUBLE
  • FLOAT
  • INT
  • INTEGER
  • MAP
  • SMALLINT
  • STRING
  • STRUCT
  • TIMESTAMP
  • TINYINT
  • VARCHAR 

2. Athena SQL Operators

An Operator performs an action on one or more data values. For example, every time we add two numbers, we are performing an addition operation using the “+” operator.

Athena SQL has 9 different types of Operators depending on the data type. They are Array Operators, Comparison Operators, Decimal Operators, Date and Time Operators, JSON Operators, Logical Operators, Map Operators, Mathematical Operators and String Operators.

The below table lists the Operator definitions and syntax in Athena SQL.

Less than

Greater than

<=

Less than or equal to

>=

Greater than or equal to

=

Equal

<>  or  !=

Not equal

+

Addition

Subtraction

*

Multiplication

/

Division

%

Remainder or Modulus

||

Concatenate

AND

Logical AND

OR

Logical OR

NOT

Logical NOT

3. Athena SQL Functions

A function in Athena SQL is very similar to an Operator. Operators are great for performing simple operations. Functions on the other hand performs complex computations on multiple columns simultaneously.

Athena SQL Functions are broken down into 24 areas, which is way beyond the scope of this post. To keep things relevant, we will be focusing on the commonly used function categories.

  • Athena Aggregate Functions.
  • Athena String Functions.
  • Athena Date Functions.
  • Athena Window Functions.

3.1 Athena Aggregate Functions

In Athena, aggregate functions are used to create a condensed or summarized view of your data. They work the same as in any relational database.

The table below lists all the aggregate functions in Athena with the sql syntax.

Function
Description
approx_distinct(x )Returns the approximate number of distinct input values
approx_distinct(x, e)Returns the approximate number of distinct input values with a standard error less than e
approx_percentile(x, percentage )Returns the approximate percentile for all input values of x at the given percentage
approx_percentile(x, percentages )Returns the approximate percentile for all input values of x at each of the specified percentages
approx_percentile(x, w, percentage )Returns the approximate weighed percentile for all input values of x using the per-item weight w at the percentage p
approx_percentile(x, w, percentage, accuracy)Returns the approximate weighed percentile for all input values of x using the per-item weight w at the percentage p, with a maximum rank error of accuracy
approx_percentile(x, w, percentages)Returns the approximate weighed percentile for all input values of x using the per-item weight w at each of the given percentages specified in the array
arbitrary(x)Returns an arbitrary non-null value of x
array_agg(x)Returns an array created from the input x elements
avg(x)Returns the average (arithmetic mean) of all input values
bitwise_and_agg(x)Returns the bitwise AND of all input values in 2’s complement representation
bitwise_or_agg(x )Returns the bitwise OR of all input values in 2’s complement representation
bool_and(boolean )Returns TRUE if every input value is TRUE, otherwise FALSE
bool_or(boolean)Returns TRUE if any input value is TRUE, otherwise FALSE
checksum(x )Returns an order-insensitive checksum of the given values
corr(y, x)Returns correlation coefficient of input values
count(* )Returns the number of input rows
count(x )Returns the number of non-null input values
count_if(x )Returns the number of TRUE input values
covar_pop(y, x)Returns the population covariance of input values
covar_samp(y, x)Returns the sample covariance of input values
every(boolean)Alias for bool_and() function
geometric_mean(x )Returns the geometric mean of all input values
histogram(x)Returns a map containing the count of the number of times each input value occurs
kurtosis(x)Returns the excess kurtosis of all input values
map_agg(key, value)Returns a map created from the input key / value pairs
map_union(x<K, V>)Returns the union of all the input maps
max(x)Returns the maximum value of all input values
max(x, n)Returns n largest values of all input values of x
max_by(x, y)Returns the value of x associated with the maximum value of y over all input values
max_by(x, y, n )Returns n values of x associated with the n largest of all input values of y in descending order of y
min(x)Returns the minimum value of all input values
min(x, n )Returns n smallest values of all input values of x
min_by(x, y )Returns the value of x associated with the minimum value of y over all input values
min_by(x, y, n )Returns n values of x associated with the n smallest of all input values of y in ascending order of y
multimap_agg(key, value)Returns a multimap created from the input key / value pairs. Each key can be associated with multiple values
numeric_histogram(buckets, value)Computes an approximate histogram with up to buckets number of buckets for all values
numeric_histogram(buckets, value, weight )Computes an approximate histogram with up to buckets number of buckets for all values with a per-item weight of weight
regr_intercept(y, x)Returns linear regression intercept of input values. y is the dependent value. x is the independent value
regr_slope(y, x)Returns linear regression slope of input values. y is the dependent value. x is the independent value
skewness(x )Returns the skewness of all input values
stddev(x)Alias for stddev_samp() function
stddev_pop(x )Returns the population standard deviation of all input values
stddev_samp(x )Returns the sample standard deviation of all input values
sum(x)Returns the sum of all input values
var_pop(x)Returns the population variance of all input values
var_samp(x)Returns the sample variance of all input values
variance(x )Alias for var_samp() function

3.2 Athena String Functions

Similar to string functions in a database, you can use Athena String functions to manipulate data stored as character strings. 

Since Athena is based on Presto, Athena String functions are a one-to-one match between the two.  The table below lists string functions, and the Athena SQL syntax for it.


Function
Description
chr(n )Returns the Unicode code point n as a single character string
codepoint(string)Returns the Unicode code point of the only character of string
concat(string1, …, stringN)Returns the concatenation of string1, string2, …, stringN
from_utf8(binary)Decodes a UTF-8 encoded string from binary
from_utf8(binary, replace )Decodes a UTF-8 encoded string from binary
length(string)This function returns the length of string in Athena
levenshtein_distance(string1, string2)Returns the Levenshtein edit distance of string1 and string2
lower(string)Converts string to lowercase
lpad(string, size, padstring)Left pads string to size characters with padstring
ltrim(string)Removes leading whitespace from string
normalize(string)Transforms string with NFC normalization form
normalize(string, form)Transforms string with the specified normalization form
position(substring IN string)Returns the starting position of the first instance of substring in string
replace(string, search )Removes all instances of search from string
replace(string, search, replace )Replaces all instances of search with replace in string
reverse(string)Returns string with the characters in reverse order
rpad(string, size, padstring)Right pads string to size characters with padstring
rtrim(string )Removes trailing whitespace from string
split(string, delimiter )Splits string on delimiter and returns an array
split(string, delimiter, limit)Splits string on delimiter and returns an array of size at most limit
split_part(string, delimiter, index)Splits string on delimiter and returns the field index
split_to_map(string, entryDelimiter, keyValueDelimiter)Splits string by entryDelimiter and keyValueDelimiter and returns a map. entryDelimiter splits string into key-value pairs
strpos(string, substring )Returns the starting position of the first instance of substring in string
substr(string, start )This Athena substring function returns a subset of a given string starting at position start
substr(string, start, length )If you want a specific number of characters (length) from a starting position, then use this alternate version of the Athena substring function
to_utf8(string)Encodes string into a UTF-8 varbinary representation
trim(string)Removes leading and trailing whitespace from string
upper(string )Converts string to uppercase

3.3 Athena Date Functions

Athena Date Functions have some quirks you need to be familiar with.  

  • Date Functions listed without parenthesis below do not require them.

  • The Unit parameter below can range from time to year. The valid unit values and formats are millisecond, second, minute, hour, day, week, month, quarter, year.

  • Athena Date and time format specifiers are listed in the table below.

%a

Abbreviated weekday name (Sun .. Sat)

%I

Hour (01 .. 12)

%r

Time, 12-hour

%b

Abbreviated month name (Jan .. Dec)

%i

Minutes, numeric (00 .. 59)

%s

Seconds (00 .. 59)

%c

Month, numeric (0 .. 12)

%j

Day of year (001 .. 366)

%T

Time, 24-hour

%d

Day of the month, numeric (00 .. 31)

%k

Hour (0 .. 23)

%v

Week (01 .. 53)

%e

Day of the month, numeric (0 .. 31)

%l

Hour (1 .. 12)

%W

Weekday name

%f

Fraction of second

%M

Month

%Y

Year, numeric, four digits

%H

Hour (00 .. 23)

%m

Month in numeric

%y

Year, numeric (two digits) [2]

%h

Hour (01 .. 12)

%p

AM or PM

%r

Time, 12-hour

Function

Description

current_date

Returns the current date as of the start of the query

current_time

Returns the current time as of the start of the query

current_timestamp

Returns the current timestamp as of the start of the query

current_timezone( )

Returns the current time zone

date_add(unit, value, timestamp)

Adds an interval value of type unit to timestamp

date_diff(unit, timestamp1, timestamp2)

Returns timestamp2 – timestamp1 expressed in terms of unit

date_format(timestamp, format)

Formats timestamp as a string using format

date_parse(string, format)

Parses string into a timestamp using format

date_trunc(unit, x)

Returns x truncated to unit

day(x)

Returns the day of the month from x

day_of_month(x)

This is an alias for day()

day_of_week(x)

Returns the ISO day of the week from x

day_of_year(x)

Returns the day of the year from x

extract(field FROM x)

Returns field from x where field can be DAY, DAY_OF_MONTH, DAY_OF_WEEK, DAY_OF_YEAR, HOUR, MINUTE, MONTH, QUARTER, SECOND, TIMEZONE_HOUR, TIMEZONE_MINUTE, WEEK, YEAR, YEAR_OF_WEEK

format_datetime(timestamp, format)

Formats timestamp as a string using format

from_iso8601_date(string)

Parses the ISO 8601 formatted string into a date

from_iso8601_timestamp(string)

Parses the ISO 8601 formatted string into a timestamp with time zone

from_unixtime(unixtime)

Returns the UNIX timestamp unixtime as a timestamp

from_unixtime(unixtime, hours, minutes)

Returns the UNIX timestamp unixtime as a timestamp with time zone using hours and minutes

from_unixtime(unixtime, string)

Returns the UNIX timestamp unixtime as a timestamp

hour(x)

Returns the hour of the day from x

localtime

Returns the current time as of the start of the query

localtimestamp

Returns the current timestamp as of the start of the query

minute(x)

Returns the minute of the hour from x

month(x)

Returns the month of the year from x

now()

This is an alias for current_timestamp

parse_datetime(string, format)

Parses string into a timestamp with time zone using format

quarter(x)

Returns the quarter of the year from x

3.4 Athena Window Functions

Type

Function

Description

Aggregate Function

* Refer Aggregate Functions tab

Agrregate Functions can be used as window functions by adding the OVER clause

Ranking Function

cume_dist( )

Returns the cumulative distribution of a value in a group of values

Ranking Function

dense_rank()

Returns the rank of a value in a group of values

Ranking Function

ntile(n)

Divides the rows for each window partition into n buckets ranging from 1 to at most n

Ranking Function

percent_rank()

Returns the percentage ranking of a value in group of values

Ranking Function

rank( )

Returns the rank of a value in a group of values

Ranking Function

row_number( )

Returns a unique, sequential number for each row, starting with one

Value Function

first_value(x)

Returns the first value of the window

Value Function

last_value(x)

Returns the last value of the window

Value Function

nth_value(x, offset)

Returns the value at the specified offset from beginning the window

Value Function

lead(x[, offset[, default_value]] )

Returns the value at offset rows after the current row in the window

Value Function

lag(x[, offset[, default_value]])

Returns the value at offset rows before the current row in the window starting at 0

4. Athena SQL DDL Clauses

DDL stands for Data Definition Language and is a part of the Structured Query Language (SQL) class. DDL statements are generally used to create or modify the structural metadata of the actual data. In Amazon Athena, objects such as Databases, Schemas, Tables, Views and Partitions are part of DDL.

Athena SQL DDL is based on Hive DDL, so if you have used the Hadoop framework, these DDL statements and syntax will be quite familiar.


Key point to note, not all Hive DDL statements are supported in Amazon Athena SQL. This is because data in Athena is stored externally in S3, and not in a database. For instance, DDL statements related to INDEXES, ROLES, LOCKS, IMPORT, EXPORT and COMMIT are not supported in Athena SQL.

The table below lists the 24 DDL statements supported in Athena SQL. For details on Athena DDL syntax, usage and parameters click here.

1.      ALTER DATABASE SET DBPROPERTIES

10.   CREATE TABLE

19.   SHOW COLUMNS

2.      ALTER TABLE ADD COLUMNS

11.   CREATE TABLE AS

20.   SHOW CREATE TABLE

3.      ALTER TABLE ADD PARTITION

12.   CREATE VIEW

21.   SHOW CREATE VIEW

4.      ALTER TABLE DROP PARTITION

13.   DESCRIBE TABLE

22.   SHOW DATABASES

5.      ALTER TABLE RENAME PARTITION

14.   DESCRIBE VIEW

23.   SHOW PARTITIONS

6.      ALTER TABLE REPLACE COLUMNS

15.   DROP DATABASE

24.   SHOW TABLES

7.      ALTER TABLE SET LOCATION

16.   DROP TABLE

25.   SHOW TBLPROPERTIES

8.      ALTER TABLE SET TBLPROPERTIES

17.   DROP VIEW

26.   SHOW VIEWS

9.      CREATE DATABASE

18.   MSCK REPAIR TABLE

 

5. Athena SQL DML Clauses

DML stands for Data Manipulation Language and is a part of the Structured Query Language (SQL) class. In a relational database, every time a SELECT, INSERT, DELETE or UPDATE statement is executed you are manipulating data and thereby executing a DML statement.

When an Athena SQL DML statement is executed, it manipulates data stored in Amazon S3 (Simple Storage Service); therefore, support for DML statements like INSERT, DELETE, UPDATE and MERGE does not exist in Athena SQL.


Currently, the only Athena SQL DML supported is the SELECT statement.


Related: CloudFormation Parameters: Make your life simple 

Recent Posts

Table of Contents

Helpful links

AWS Athena CLI

Interact with Athena using shell commands from Windows PowerShell, Linux or remotely.

AWS Athena Documentation

This is the latest user guide version of AWS Athena Documentation

Interested in our services ?

email us at : info@obstkel.com

Copyright 2022 © OBSTKEL LLC. All rights Reserved

The post Athena SQL basics – How to write SQL against files appeared first on OBSTKEL.

]]>
https://www.obstkel.com/amazon-athena-sql/feed 0
14 Redshift Data Types to make you the office hero(2022) https://www.obstkel.com/redshift-data-types Tue, 14 Sep 2021 02:32:28 +0000 https://www.obstkel.com/?p=1319 The less is more approach to Redshift data types. Focus on these 14 character, numeric, datetime and boolean data types in Redshift.

The post 14 Redshift Data Types to make you the office hero(2022) appeared first on OBSTKEL.

]]>

14 Redshift Data Types to make you the office hero(2022)

redshift data type

Redshift data types are not a whole lot different from the standard relational database types. 

Relational Databases store data in tables, which are made up of rows and columns. A column is the smallest granularity of logical data storage. Each of these columns have attributes associated with it. A Redshift Data Type, in this context defines the attributes of a column.

There are 4 categories of built-in Redshift data types: Character, Numeric, Datetime and Boolean. Knowing these data types and their attributes is key to writing quality DDL statements in Redshift.

The tables below list the types within each of these categories.

Character Redshift data types

CHAR

The important stuff about CHAR:

  • A CHAR in Redshift is a fixed length character string with a maximum length of 4096 bytes.

  • CHAR, CHARACTER, NCHAR are the same data types in Redshift.

  • You declare a CHAR data type as shown below –

char(10) or character(10) or nchar(10)

Let us look at an example of creating a table in Redshift with the char data type.

Since a char datatype uses up the entire allocated space, use char types for small fields. For larger character fields, use VARCHAR.

CREATE TABLE employees
(
marital_status char(1) default 'U'
);

VARCHAR

The important stuff about VARCHAR:

  • In Redshift, VARCHAR is a variable length character data type string.

  • The default length of VARCHAR is 256.

  • The Redshift VARCHAR max length is 65,535 bytes.

  • VARCHAR, NVARCHAR, TEXT and CHARACTER VARYING are the same data types in Redshift.

  • You declare a VARCHAR data type as shown below. 
varchar(20)or nvarchar(10) or text(10) or character varying (10)

Below is an example of a redshift create table statement with two VARCHAR fields, first name and last name.

CREATE TABLE employees
(
first_name varchar(30),
last_name varchar(30)
);

Redshift Numeric data types

An incorrectly defined Redshift numeric datatype can wreak havoc on performance and throw off your calculations. So, lets focus on the simple basics! 

A Redshift numeric data type is used to store numbers, we all know that. But what kind of numbers?

  • Integers – Also known as whole numbers. Picture you at a grocery store buying apples. One apple, a dozen (12) apples, and so on. You do not buy half an apple!

  • Decimals – These are numbers where a quantity less than one is denoted on the right of a decimal point. Example 2.5, 9.9, 12.5. 
    For instance, assuming you are still at the grocery store, and decide to buy potatoes or onions. You could be picking up a 2.5 lb bag. That’s an example of a decimal.

  • Floating-Point – A floating point number is similar to a decimal, except that the number of digits to the right of the decimal point can vary. Your paycheck for instance could be 950.50, 950.333, 950.0154.

Rule of thumb, if the number of digits to the right of the decimal is constant, use a decimal type. If they vary based on computation, then use a floating-point.

Read the above definitions a couple of times, and let it sink in. 

SMALLINT

The important stuff about SMALLINT:

  • A Redshift smallint can store upto 2 bytes of information.

  • Use a Redshift smallint data type to store whole numbers in the range -32,768 to +32,767.

  • Syntax for a SMALLINT is

smallint or int2

INTEGER

The important stuff about INTEGER:

  • Use the INTEGER data type in Redshift to store whole numbers in the range -2,147,483,648 to +2,147,483,647.

  • Syntax for an INTEGER is –

integer or int or int4

BIGINT

The important stuff about BIGINT:

  • If you need to store really large whole numbers in the range -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807, then use the Redshift BIGINT data type.

  • A Redshift BIGINT can store up to 8 bytes of information.

  • Syntax for a BIGINT is –

bigint or int8

DECIMAL

The important stuff about DECIMAL:

  • DECIMAL uses up to 128 bytes to store numeric data as signed integers with a precision of up to 38 digits.

  • If you need to store numbers with scale and precision, then use the Redshift DECIMAL data type.

  • Precision refers to the sum of the digits to the left and right of the decimal point. The default precision is 18 and the max precision limit is 38.

  • Scale refers to the number of digits to the right of the decimal point. The default scale is 0, and the max scale can be as high as 37.

  • Syntax for a Redshift DECIMAL data type is –

decimal(precision, scale)

Now let’s look at an example on defining a Redshift decimal column.

Example1: Define a Redshift decimal column based on the number 3219.22
  • Precision: Number of digits to the left of the decimal + number of digits to the right of the decimal = 6
  • Scale: Number of digits to the right of the decimal = 2
  • Column definition = decimal (6,2)

REAL

The important stuff about REAL:

  • Use the REAL or FLOAT4 data type to store numbers with up to 6 digits of variable precision.

  • Syntax for a REAL data type is –

real or float4

FLOAT

The important stuff about FLOAT:

  • FLOAT stores numeric data with up to 15 digits of variable precision.

  • Syntax for a Redshift FLOAT data type is –
float or float8 or double precision

DateTime data types

DATE

The important stuff about Redshift DATE data type:

  • The DATE data type uses 4 bytes to store the Calendar date in the default format YYYY-MM-DD.

  • The date range goes from 4713 BC to 294276 AD.

  • Syntax for a DATE data type is as shown below.

date

TIME
STAMP

The important stuff about TIMESTAMP:

  • TIMESTAMP uses 8 bytes to store date and time of day in default format YYYY-MM-DD HH:MI:SS.

  • This type does not include TIME ZONE. 

  • Similar to the DATE data type, the range goes from 4713 BC to 294276 AD.

  • Syntax for a Redshift TIMESTAMP is

timestamp

TIME

The important stuff about TIME:

  • TIME uses 8 bytes to store the time of day without the TIME ZONE.

  • For displaying time in a 24-hour clock format use HH24:MI:SS.

  • If you are displaying time in a 12-hour clock format, then use HH12:MI:SS.

  • Syntax for TIME is –

time

TIMETZ

The important stuff about TIMETZ:

  • TIMETZ uses 8 bytes to store the time of day with the time zone,

  • Syntax for Redshift time of day with time zone is –

timetz

TIME
STAMPTZ

The important stuff about TIMESTAMPTZ:

  • To capture timestamp with the time zone, use TIMESTAMPTZ.

  • TIMESTAMPTZ uses 8 bytes to store data in the format YYYY-MM-DD HH:MI:SS TZ.

  • Syntax for a Redshift timestamp with time zone type is –

timestamptz

Redshift Boolean Data Type

BOOLEAN

The important stuff about Boolean data type:

  • A Redshift Boolean data type is a single byte column used to store true or false values.

  • You can use ‘1’, ‘t’,’ y’, ‘yes’, ‘true’ or ‘TRUE’ to represent a True value in your input.

  • False values can be represented as ‘0’, ‘f’, ‘n’, ‘no’, ‘false’ or ‘FALSE‘ in the input.

  • Unknowns are represented as NULL.

  • The syntax for a Boolean data type in Redshift is –

boolean

Trending questions on Redshift Data Types

What are Redshift column types?

Redshift column types are the same as Redshift data types.

Tables in a database are made up of one or more columns. Each column is intended to store a certain kind of data. Depending on the type of data stored, the column in a table needs to be different.

For example, a NAME column requires a data type of Character, whereas a PRICE column requires a Numeric type. 

A data type in Redshift is the attribute of a single column in a table. In other words, the Redshift data type property used to define the attribute of a Redshift column.

Often, we tend to use column and field interchangeably as well. So don’t let that confuse you.

In summary, if I have 3 columns in a table with FIRST_NAME, LAST_NAME and MIDDLE_NAME, the data types for them (assume VARCHAR) are the same as the Redshift column types.

Table of Contents

Redshift helpful links

Amazon Redshift Documentation

This is the latest version of Amazon Redshift Documentation

Amazon Redshift & Analytics

Another great blog post by Nick Corbett, AWS Professional Services on Agile Analytics with Amazon Redshift

Recent Posts on Redshift

Interested in our services ?

email us at : info@obstkel.com

Copyright 2022 © OBSTKEL LLC. All rights Reserved

The post 14 Redshift Data Types to make you the office hero(2022) appeared first on OBSTKEL.

]]>