Athena SQL basics – How to write SQL against files

athen sql basics

Understanding Athena SQL is not a monumental task if you get the basics right. There are 5 areas you need to understand as listed below

  1. Athena Data Types
  2. Athena SQL Operators
  3. Athena SQL Functions
  4. Athena SQL DDL 
  5. Athena SQL DML

Before we get to the SQL part, lets make sure you have a good understanding of what Amazon Athena is.

Whats is Amazon Athena ?

Amazon Athena, is a web service by AWS used to analyze data in Amazon S3 using SQL.

It runs in the Cloud (or a server) and is part of the AWS Cloud Computing Platform. In many respects, it is like a SQL graphical user interface (GUI) we use against a relational database to analyze data. The main difference is Amazon Athena helps you read and analyze data in files using SQL instead of data stored in a database.

What makes Amazon Athena different ?

The key difference, unlike traditional SQL queries that run against tables in a database Amazon Athena runs against files. Athena can analyze structured, unstructured and semi-structured data stored in an S3 bucket. It can read Apache Web Logs and data formatted in JSON, ORC, Parquet, TSV, CSV and text files with custom delimiters.

Secondly, Amazon Athena does not store the data being analyzed. Athena does have the concept of databases and tables, but they store metadata regarding the file location and the structure of the data.

Thirdly, Amazon Athena is serverless, which means provisioning capacity,  scaling, patching, and OS maintenance is handled by AWS. And finally, Athena executes SQL queries in parallel, which means faster outputs.

1. Athena Data types

A Data Type defines the attributes of a value. It also classifies the SQL operations that can be performed on a value. For example, an Athena data type of DATE denotes that a value is a date, and should contain Year, Month and Day information. It also means only DATE related SQL operations can be performed on that value.

Similar to defining Data Types in a relational database, AWS Athena Data Types are defined for each column in a table. These data types form the meta data definition of the dataset, which is stored in the AWS Glue Data Catalog. 

AWS Athena has 16 distinct data types, which are listed below.

Data Type
Description
BOOLEAN
Denotes if value is TRUE or FALSE
TINYINT
An 8 bit signed integer
SMALLINT
A 16 bit signed integer
INT
INT data type in Athena DDL and a 32 bit signed value for other queries
BIGINT
A 64 bit signed integer
DOUBLE
An 8 byte double precision floating point number
FLOAT
A 4 byte single precision floating point number
DECIMAL [ (precision, scale) ]
Represents decimal values, with precision equals the total number of digits,
and scale the number of digits in the fraction
CHAR
Denotes fixed length character data with a maximum length of 255
VARCHAR
Denotes variable length character data with a maximum length 65535
BINARY
Used for raw byte data. Mostly used for data in Parquet format
DATE
Date in YYYY-MM-DD format
TIMESTAMP
Date and time combined YYYY-MM-DD HH:MM:SS[.f…]
ARRAY < data_type >
A collection of values of a specific data type
MAP < primitive_type, data_type >
Key-value pairs made of AWS Athena data types
STRUCT < col_name : data_type [COMMENT col_comment] [, …] >
Represents multiple fields of a value/item

2. Athena SQL Operators

An Operator performs an action on one or more data values. For example, every time we add two numbers, we are performing an addition operation using the “+” operator.

Athena SQL has nine different types of Operators depending on the data type. They are Array Operators, Comparison Operators, Decimal Operators, Date and Time Operators, JSON Operators, Logical Operators, Map Operators, Mathematical Operators and String Operators.

The below table lists the Operator definitions and syntax in alphabetic order.

Type
Operator
Description
Array Operator
[ ]
Used to access an element of an array and is indexed starting from 1
||
Used to concatenate an array with an array or an element of the same type
Comparison Operators
<
Less than
>
Greater than
<=
Less than or equal to
>=
Greater than or equal to
=
Equals to
<> or !=
Not equal to
Decimal Operators
Subtraction
+
Addition
x
Multiplication
/
Division
%
Modulus
Date and Time Operators
+
Used to add dates or timestamps in conjunction with interval functions
Used to subtract dates or timestamps in conjuntion with interval functions
JSON Operators
CAST
Used to cast data in Athena SQL data types into JSON format and vice versa
Logical Operators
AND
TRUE if both values A and B are true
OR
TRUE if either A or B is true
NOT
TRUE if the value is false
Map Operator
[ ]
Used to retrieve the value corresponding to a given key from a map
Mathematical Operators
* refer decimal operators
Same as Decimal Operators
String Operators
||
Used to perform concatenation

3. Athena SQL Functions

A function in Athena SQL is very similar to an Operator. Operators are great for performing simple operations. Functions on the other hand performs complex computations on multiple columns simultaneously.

Athena SQL has 17 function categories as listed below.

Let us start by looking at the four most commonly used categories: Athena Aggregate Functions, Athena String Functions, Athena Date Functions and Athena Window Functions.

Athena Aggregate Functions

In Athena, aggregate functions are used to create a condensed or summarized view of your data. They work the same as in any relational database.

The table below lists all the aggregate functions in Athena.

Function
Description
approx_distinct(x )Returns the approximate number of distinct input values
approx_distinct(x, e)Returns the approximate number of distinct input values with a standard error less than e
approx_percentile(x, percentage )Returns the approximate percentile for all input values of x at the given percentage
approx_percentile(x, percentages )Returns the approximate percentile for all input values of x at each of the specified percentages
approx_percentile(x, w, percentage )Returns the approximate weighed percentile for all input values of x using the per-item weight w at the percentage p
approx_percentile(x, w, percentage, accuracy)Returns the approximate weighed percentile for all input values of x using the per-item weight w at the percentage p, with a maximum rank error of accuracy
approx_percentile(x, w, percentages)Returns the approximate weighed percentile for all input values of x using the per-item weight w at each of the given percentages specified in the array
arbitrary(x)Returns an arbitrary non-null value of x
array_agg(x)Returns an array created from the input x elements
avg(x)Returns the average (arithmetic mean) of all input values
bitwise_and_agg(x)Returns the bitwise AND of all input values in 2’s complement representation
bitwise_or_agg(x )Returns the bitwise OR of all input values in 2’s complement representation
bool_and(boolean )Returns TRUE if every input value is TRUE, otherwise FALSE
bool_or(boolean)Returns TRUE if any input value is TRUE, otherwise FALSE
checksum(x )Returns an order-insensitive checksum of the given values
corr(y, x)Returns correlation coefficient of input values
count(* )Returns the number of input rows
count(x )Returns the number of non-null input values
count_if(x )Returns the number of TRUE input values
covar_pop(y, x)Returns the population covariance of input values
covar_samp(y, x)Returns the sample covariance of input values
every(boolean)Alias for bool_and() function
geometric_mean(x )Returns the geometric mean of all input values
histogram(x)Returns a map containing the count of the number of times each input value occurs
kurtosis(x)Returns the excess kurtosis of all input values
map_agg(key, value)Returns a map created from the input key / value pairs
map_union(x<K, V>)Returns the union of all the input maps
max(x)Returns the maximum value of all input values
max(x, n)Returns n largest values of all input values of x
max_by(x, y)Returns the value of x associated with the maximum value of y over all input values
max_by(x, y, n )Returns n values of x associated with the n largest of all input values of y in descending order of y
min(x)Returns the minimum value of all input values
min(x, n )Returns n smallest values of all input values of x
min_by(x, y )Returns the value of x associated with the minimum value of y over all input values
min_by(x, y, n )Returns n values of x associated with the n smallest of all input values of y in ascending order of y
multimap_agg(key, value)Returns a multimap created from the input key / value pairs. Each key can be associated with multiple values
numeric_histogram(buckets, value)Computes an approximate histogram with up to buckets number of buckets for all values
numeric_histogram(buckets, value, weight )Computes an approximate histogram with up to buckets number of buckets for all values with a per-item weight of weight
regr_intercept(y, x)Returns linear regression intercept of input values. y is the dependent value. x is the independent value
regr_slope(y, x)Returns linear regression slope of input values. y is the dependent value. x is the independent value
skewness(x )Returns the skewness of all input values
stddev(x)Alias for stddev_samp() function
stddev_pop(x )Returns the population standard deviation of all input values
stddev_samp(x )Returns the sample standard deviation of all input values
sum(x)Returns the sum of all input values
var_pop(x)Returns the population variance of all input values
var_samp(x)Returns the sample variance of all input values
variance(x )Alias for var_samp() function

Athena String Functions

Similar to string functions in a database, you can use Athena String functions to manipulate data stored as character strings. 

Since Athena is based on Presto, Athena String functions are a one to one match between the two.  The table below lists string functions, and the Athena SQL syntax for it.

Function
Description
chr(n )Returns the Unicode code point n as a single character string
codepoint(string)Returns the Unicode code point of the only character of string
concat(string1, …, stringN)Returns the concatenation of string1, string2, …, stringN
from_utf8(binary)Decodes a UTF-8 encoded string from binary
from_utf8(binary, replace )Decodes a UTF-8 encoded string from binary
length(string)This function returns the length of string in Athena
levenshtein_distance(string1, string2)Returns the Levenshtein edit distance of string1 and string2
lower(string)Converts string to lowercase
lpad(string, size, padstring)Left pads string to size characters with padstring
ltrim(string)Removes leading whitespace from string
normalize(string)Transforms string with NFC normalization form
normalize(string, form)Transforms string with the specified normalization form
position(substring IN string)Returns the starting position of the first instance of substring in string
replace(string, search )Removes all instances of search from string
replace(string, search, replace )Replaces all instances of search with replace in string
reverse(string)Returns string with the characters in reverse order
rpad(string, size, padstring)Right pads string to size characters with padstring
rtrim(string )Removes trailing whitespace from string
split(string, delimiter )Splits string on delimiter and returns an array
split(string, delimiter, limit)Splits string on delimiter and returns an array of size at most limit
split_part(string, delimiter, index)Splits string on delimiter and returns the field index
split_to_map(string, entryDelimiter, keyValueDelimiter)Splits string by entryDelimiter and keyValueDelimiter and returns a map. entryDelimiter splits string into key-value pairs
strpos(string, substring )Returns the starting position of the first instance of substring in string
substr(string, start )This Athena substring function returns a subset of a given string starting at position start
substr(string, start, length )If you want a specific number of characters (length) from a starting position, then use this alternate version of the Athena substring function
to_utf8(string)Encodes string into a UTF-8 varbinary representation
trim(string)Removes leading and trailing whitespace from string
upper(string )Converts string to uppercase

Athena Date Functions

Function
Description
current_dateReturns the current date as of the start of the query
current_timeReturns the current time as of the start of the query
current_timestampReturns the current timestamp as of the start of the query
current_timezone( )Returns the current time zone
date_add(unit, value, timestamp)Adds an interval value of type unit to timestamp
date_diff(unit, timestamp1, timestamp2)Returns timestamp2 – timestamp1 expressed in terms of unit
date_format(timestamp, format)Formats timestamp as a string using format
date_parse(string, format )Parses string into a timestamp using format
date_trunc(unit, x )Returns x truncated to unit
day(x)Returns the day of the month from x
day_of_month(x )This is an alias for day()
day_of_week(x)Returns the ISO day of the week from x
day_of_year(x )Returns the day of the year from x
extract(field FROM x )Returns field from x
format_datetime(timestamp, format)Formats timestamp as a string using format
from_iso8601_date(string)Parses the ISO 8601 formatted string into a date
from_iso8601_timestamp(string)Parses the ISO 8601 formatted string into a timestamp with time zone
from_unixtime(unixtime )Returns the UNIX timestamp unixtime as a timestamp
from_unixtime(unixtime, hours, minutes )Returns the UNIX timestamp unixtime as a timestamp with time zone using hours and minutes
from_unixtime(unixtime, string)Returns the UNIX timestamp unixtime as a timestamp
hour(x )Returns the hour of the day from x
localtimeReturns the current time as of the start of the query
localtimestampReturns the current timestamp as of the start of the query
minute(x )Returns the minute of the hour from x
month(x)Returns the month of the year from x
now()This is an alias for current_timestamp
parse_datetime(string, format )Parses string into a timestamp with time zone using format
quarter(x)Returns the quarter of the year from x

Athena Window Functions

Type
Function
Description
Aggregate Function* Refer Aggregate Functions tabAgrregate Functions can be used as window functions by adding the OVER clause
Ranking Functioncume_dist( )Returns the cumulative distribution of a value in a group of values
Ranking Functiondense_rank()Returns the rank of a value in a group of values
Ranking Functionntile(n)Divides the rows for each window partition into n buckets ranging from 1 to at most n
Ranking Functionpercent_rank()Returns the percentage ranking of a value in group of values
Ranking Functionrank( )Returns the rank of a value in a group of values
Ranking Functionrow_number( )Returns a unique, sequential number for each row, starting with one
Value Functionfirst_value(x)Returns the first value of the window
Value Functionlast_value(x)Returns the last value of the window
Value Functionnth_value(x, offset)Returns the value at the specified offset from beginning the window
Value Functionlead(x[, offset[, default_value]] )Returns the value at offset rows after the current row in the window
Value Functionlag(x[, offset[, default_value]])Returns the value at offset rows before the current row in the window starting at 0

4. Athena SQL DML Clauses

DML stands for Data Manipulation Language and is a part of the Structured Query Language (SQL) class. In a relational database, every time a SELECT, INSERT, DELETE or UPDATE statement is executed you are manipulating data and thereby executing a DML statement.

When an Athena SQL DML statement is executed it manipulates data stored in Amazon S3 (Simple Storage Service); therefore, support for DML statements like INSERT, DELETE, UPDATE and MERGE does not exist in Athena SQL. Currently, the only Athena SQL DML supported is the SELECT statement.

5. Athena SQL DDL Clauses

DDL stands for Data Definition Language, and is a part of the Structured Query Language (SQL) class. DDL statements are generally used to create or modify the structural metadata of the actual data. In Amazon Athena, objects such as Databases, Schemas, Tables, Views and Partitions are part of DDL.

Athena SQL DDL is based on Hive DDL, so if you have used the Hadoop framework, these DDL statements and syntax will be quite familiar. Key point to note, not all Hive DDL statements are supported in Amazon Athena SQL. This is because data in Athena is stored externally in S3, and not in a database. For example, DDL statements related to INDEXES, ROLES, LOCKS, IMPORT, EXPORT and COMMIT are not supported in Athena SQL.

The table below lists the 24 DDL statements supported in Athena SQL.

Statement
Description
Example
ALTER DATABASE SET DBPROPERTIES
Creates one or more properties for a database
ALTER DATABASE test_db
SET DBPROPERTIES (‘owner’=’Name’, ‘department’=’Finance’);
ALTER TABLE ADD PARTITION
Creates one or more partition columns for the table
Option1:
ALTER TABLE CUSTOMERS ADD PARTITION (region = ‘EAST’);
Option2:
ALTER TABLE CUSTOMERS ADD PARTITION (region = ‘EAST’) LOCATION ‘s3://test-data/customers’;
ALTER TABLE DROP PARTITION
Drops one or more partitions from the table
ALTER TABLE CUSTOMERS DROP PARTITION (region = ‘EAST’);
ALTER TABLE RENAME PARTITION
Renames an existing partition to a new name
ALTER TABLE CUSTOMERS PARTITION (region = ‘EAST’) RENAME TO PARTITION (region = ‘WEST’);
ALTER TABLE
SET LOCATION
Changes the location of a given table
ALTER TABLE CUSTOMERS PARTITION (region = ‘EAST’) SET LOCATION ‘s3://test-data/customers’;
ALTER TABLE SET TBLPROPERTIES
Add custom metadata properties to a table
ALTER TABLE CUSTOMERS SET TBLPROPERTIES (‘comments’=”Backup before dropping”);
CREATE DATABASE
Creates a database/schema
Option1:
      CREATE DATABASE customers;
Option2:
      CREATE DATABASE customers
      COMMENT ‘Cusotmers data for the last 2 years’
      LOCATION ‘s3://test-data/customers’
      WITH DBPROPERTIES (‘owner’=’Name’, ‘data-type’=’customers’);
CREATE TABLE
Creates a table with the data in S3
CREATE EXTERNAL TABLE customers
(
firstname string,
lastname string,
middlename string,
region string,
annual_total double,
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE LOCATION
‘s3://test-data/’;
CREATE TABLE AS
Create table from the results of a select query
CREATE TABLE customers_subset
AS SELECT * FROM customers;
CREATE VIEW
Creates a logical table from a select query
CREATE VIEW customers_vw AS
SELECT firstname, middlename, lastname FROM customers;
DESCRIBE TABLE
Describes the column attributes of a table
DESCRIBE customers;
DESCRIBE VIEW
Describes the column attributes of a view
DESCRIBE customers_vw;
DROP
Can be used to drop a TABLE, VIEW or DATABASE
DROP DATABASE test_db;
DROP TABLE customers;
DROP VIEW customers_vw;
MSCK REPAIR TABLE
Recovers partitions and data associated with partitions
MSCK REPAIR TABLE customers;
SHOW
Can be used to view the details of COLUMNS,DATABASES,
PARTITIONS, TABLES, VIEWS, TBLPROPERTIES, CREATE TABLE and CREATE VIEW
SHOW COLUMNS IN customers;
SHOW test_db;
SHOW PARTITIONS customers;
SHOW TABLES IN test_db;
SHOW VIEWS IN test_db;
SHOW TBLPROPERTIES customers;
SHOW CREATE TABLE customers;
SHOW CREATE VIEW customers_vw;

Athena SQL Related Resources

AWS Athena Documentation

This is the latest user guide version of AWS Athena Documentation

AWS Athena Limits

Learn about the service limits of AWS Athena

AWS Athena CLI

AWS Athena CLI (Command Line Interface) allows users to interact with Athena using shell commands from Windows PowerShell, Linux or remotely.

Amazon Athena Pricing

Learn how data format, compression, partitioning and storage, factors into Amazon Athena Pricing.

Email us at : info@obstkel.com

Copyright 2020 © OBSTKEL LLC. All rights Reserved.
Scroll to Top