
Athena SQL is the query language used in Amazon Athena to interact with data in S3. Mastering Athena SQL is not a monumental task if you get the basics right. There are 5 areas you need to understand as listed below.
Before we get to the SQL part, lets make sure you have a good understanding of what Amazon Athena is.
Amazon Athena is a web service by AWS used to analyze data in Amazon S3 using SQL.
It runs in the Cloud (or a server) and is part of the AWS Cloud Computing Platform.
In many respects, it is like a SQL graphical user interface (GUI) we use against a relational database to analyze data. The main difference is Amazon Athena helps you read and analyze data in files using SQL instead of data stored in a database.
The key difference, unlike traditional SQL queries that run against tables in a database Amazon Athena runs against files. Athena can analyze structured, unstructured and semi-structured data stored in an S3 bucket. It can read Apache Web Logs and data formatted in JSON, ORC, Parquet, TSV, CSV and text files with custom delimiters.
Secondly, Amazon Athena does not store the data being analyzed. Athena does have the concept of databases and tables, but they store metadata regarding the file location and the structure of the data.
Thirdly, Amazon Athena is serverless, which means provisioning capacity, scaling, patching, and OS maintenance is handled by AWS. And finally, Athena executes SQL queries in parallel, which means faster outputs.
A Data Type defines the attributes of a value. It also classifies the SQL operations that can be performed on a value. For example, an Athena data type of DATE denotes that a value is a date, and should contain Year, Month and Day information. It also means only DATE related SQL operations can be performed on that value.
Similar to defining Data Types in a relational database, AWS Athena Data Types are defined for each column in a table. These data types form the meta data definition of the dataset, which is stored in the AWS Glue Data Catalog.
AWS Athena has 18 distinct data types, which are listed below in alphabetical order.
An Operator performs an action on one or more data values. For example, every time we add two numbers, we are performing an addition operation using the “+” operator.
Athena SQL has 9 different types of Operators depending on the data type. They are Array Operators, Comparison Operators, Decimal Operators, Date and Time Operators, JSON Operators, Logical Operators, Map Operators, Mathematical Operators and String Operators.
The below table lists the Operator definitions and syntax in Athena SQL.
< | Less than |
> | Greater than |
<= | Less than or equal to |
>= | Greater than or equal to |
= | Equal |
<> or != | Not equal |
+ | Addition |
– | Subtraction |
* | Multiplication |
/ | Division |
% | Remainder or Modulus |
|| | Concatenate |
AND | Logical AND |
OR | Logical OR |
NOT | Logical NOT |
A function in Athena SQL is very similar to an Operator. Operators are great for performing simple operations. Functions on the other hand performs complex computations on multiple columns simultaneously.
Athena SQL Functions are broken down into 24 areas, which is way beyond the scope of this post. To keep things relevant, we will be focusing on the commonly used function categories.
In Athena, aggregate functions are used to create a condensed or summarized view of your data. They work the same as in any relational database.
The table below lists all the aggregate functions in Athena with the sql syntax.
| approx_distinct(x ) | Returns the approximate number of distinct input values |
| approx_distinct(x, e) | Returns the approximate number of distinct input values with a standard error less than e |
| approx_percentile(x, percentage ) | Returns the approximate percentile for all input values of x at the given percentage |
| approx_percentile(x, percentages ) | Returns the approximate percentile for all input values of x at each of the specified percentages |
| approx_percentile(x, w, percentage ) | Returns the approximate weighed percentile for all input values of x using the per-item weight w at the percentage p |
| approx_percentile(x, w, percentage, accuracy) | Returns the approximate weighed percentile for all input values of x using the per-item weight w at the percentage p, with a maximum rank error of accuracy |
| approx_percentile(x, w, percentages) | Returns the approximate weighed percentile for all input values of x using the per-item weight w at each of the given percentages specified in the array |
| arbitrary(x) | Returns an arbitrary non-null value of x |
| array_agg(x) | Returns an array created from the input x elements |
| avg(x) | Returns the average (arithmetic mean) of all input values |
| bitwise_and_agg(x) | Returns the bitwise AND of all input values in 2’s complement representation |
| bitwise_or_agg(x ) | Returns the bitwise OR of all input values in 2’s complement representation |
| bool_and(boolean ) | Returns TRUE if every input value is TRUE, otherwise FALSE |
| bool_or(boolean) | Returns TRUE if any input value is TRUE, otherwise FALSE |
| checksum(x ) | Returns an order-insensitive checksum of the given values |
| corr(y, x) | Returns correlation coefficient of input values |
| count(* ) | Returns the number of input rows |
| count(x ) | Returns the number of non-null input values |
| count_if(x ) | Returns the number of TRUE input values |
| covar_pop(y, x) | Returns the population covariance of input values |
| covar_samp(y, x) | Returns the sample covariance of input values |
| every(boolean) | Alias for bool_and() function |
| geometric_mean(x ) | Returns the geometric mean of all input values |
| histogram(x) | Returns a map containing the count of the number of times each input value occurs |
| kurtosis(x) | Returns the excess kurtosis of all input values |
| map_agg(key, value) | Returns a map created from the input key / value pairs |
| map_union(x<K, V>) | Returns the union of all the input maps |
| max(x) | Returns the maximum value of all input values |
| max(x, n) | Returns n largest values of all input values of x |
| max_by(x, y) | Returns the value of x associated with the maximum value of y over all input values |
| max_by(x, y, n ) | Returns n values of x associated with the n largest of all input values of y in descending order of y |
| min(x) | Returns the minimum value of all input values |
| min(x, n ) | Returns n smallest values of all input values of x |
| min_by(x, y ) | Returns the value of x associated with the minimum value of y over all input values |
| min_by(x, y, n ) | Returns n values of x associated with the n smallest of all input values of y in ascending order of y |
| multimap_agg(key, value) | Returns a multimap created from the input key / value pairs. Each key can be associated with multiple values |
| numeric_histogram(buckets, value) | Computes an approximate histogram with up to buckets number of buckets for all values |
| numeric_histogram(buckets, value, weight ) | Computes an approximate histogram with up to buckets number of buckets for all values with a per-item weight of weight |
| regr_intercept(y, x) | Returns linear regression intercept of input values. y is the dependent value. x is the independent value |
| regr_slope(y, x) | Returns linear regression slope of input values. y is the dependent value. x is the independent value |
| skewness(x ) | Returns the skewness of all input values |
| stddev(x) | Alias for stddev_samp() function |
| stddev_pop(x ) | Returns the population standard deviation of all input values |
| stddev_samp(x ) | Returns the sample standard deviation of all input values |
| sum(x) | Returns the sum of all input values |
| var_pop(x) | Returns the population variance of all input values |
| var_samp(x) | Returns the sample variance of all input values |
| variance(x ) | Alias for var_samp() function |
Similar to string functions in a database, you can use Athena String functions to manipulate data stored as character strings.
Since Athena is based on Presto, Athena String functions are a one-to-one match between the two. The table below lists string functions, and the Athena SQL syntax for it.
| chr(n ) | Returns the Unicode code point n as a single character string |
| codepoint(string) | Returns the Unicode code point of the only character of string |
| concat(string1, …, stringN) | Returns the concatenation of string1, string2, …, stringN |
| from_utf8(binary) | Decodes a UTF-8 encoded string from binary |
| from_utf8(binary, replace ) | Decodes a UTF-8 encoded string from binary |
| length(string) | This function returns the length of string in Athena |
| levenshtein_distance(string1, string2) | Returns the Levenshtein edit distance of string1 and string2 |
| lower(string) | Converts string to lowercase |
| lpad(string, size, padstring) | Left pads string to size characters with padstring |
| ltrim(string) | Removes leading whitespace from string |
| normalize(string) | Transforms string with NFC normalization form |
| normalize(string, form) | Transforms string with the specified normalization form |
| position(substring IN string) | Returns the starting position of the first instance of substring in string |
| replace(string, search ) | Removes all instances of search from string |
| replace(string, search, replace ) | Replaces all instances of search with replace in string |
| reverse(string) | Returns string with the characters in reverse order |
| rpad(string, size, padstring) | Right pads string to size characters with padstring |
| rtrim(string ) | Removes trailing whitespace from string |
| split(string, delimiter ) | Splits string on delimiter and returns an array |
| split(string, delimiter, limit) | Splits string on delimiter and returns an array of size at most limit |
| split_part(string, delimiter, index) | Splits string on delimiter and returns the field index |
| split_to_map(string, entryDelimiter, keyValueDelimiter) | Splits string by entryDelimiter and keyValueDelimiter and returns a map. entryDelimiter splits string into key-value pairs |
| strpos(string, substring ) | Returns the starting position of the first instance of substring in string |
| substr(string, start ) | This Athena substring function returns a subset of a given string starting at position start |
| substr(string, start, length ) | If you want a specific number of characters (length) from a starting position, then use this alternate version of the Athena substring function |
| to_utf8(string) | Encodes string into a UTF-8 varbinary representation |
| trim(string) | Removes leading and trailing whitespace from string |
| upper(string ) | Converts string to uppercase |
Athena Date Functions have some quirks you need to be familiar with.
%a | Abbreviated weekday name (Sun .. Sat) | %I | Hour (01 .. 12) | %r | Time, 12-hour |
%b | Abbreviated month name (Jan .. Dec) | %i | Minutes, numeric (00 .. 59) | %s | Seconds (00 .. 59) |
%c | Month, numeric (0 .. 12) | %j | Day of year (001 .. 366) | %T | Time, 24-hour |
%d | Day of the month, numeric (00 .. 31) | %k | Hour (0 .. 23) | %v | Week (01 .. 53) |
%e | Day of the month, numeric (0 .. 31) | %l | Hour (1 .. 12) | %W | Weekday name |
%f | Fraction of second | %M | Month | %Y | Year, numeric, four digits |
%H | Hour (00 .. 23) | %m | Month in numeric | %y | Year, numeric (two digits) [2] |
%h | Hour (01 .. 12) | %p | AM or PM | %r | Time, 12-hour |
Function | Description |
current_date | Returns the current date as of the start of the query |
current_time | Returns the current time as of the start of the query |
current_timestamp | Returns the current timestamp as of the start of the query |
current_timezone( ) | Returns the current time zone |
date_add(unit, value, timestamp) | Adds an interval value of type unit to timestamp |
date_diff(unit, timestamp1, timestamp2) | Returns timestamp2 – timestamp1 expressed in terms of unit |
date_format(timestamp, format) | Formats timestamp as a string using format |
date_parse(string, format) | Parses string into a timestamp using format |
date_trunc(unit, x) | Returns x truncated to unit |
day(x) | Returns the day of the month from x |
day_of_month(x) | This is an alias for day() |
day_of_week(x) | Returns the ISO day of the week from x |
day_of_year(x) | Returns the day of the year from x |
extract(field FROM x) | Returns field from x where field can be DAY, DAY_OF_MONTH, DAY_OF_WEEK, DAY_OF_YEAR, HOUR, MINUTE, MONTH, QUARTER, SECOND, TIMEZONE_HOUR, TIMEZONE_MINUTE, WEEK, YEAR, YEAR_OF_WEEK |
format_datetime(timestamp, format) | Formats timestamp as a string using format |
from_iso8601_date(string) | Parses the ISO 8601 formatted string into a date |
from_iso8601_timestamp(string) | Parses the ISO 8601 formatted string into a timestamp with time zone |
from_unixtime(unixtime) | Returns the UNIX timestamp unixtime as a timestamp |
from_unixtime(unixtime, hours, minutes) | Returns the UNIX timestamp unixtime as a timestamp with time zone using hours and minutes |
from_unixtime(unixtime, string) | Returns the UNIX timestamp unixtime as a timestamp |
hour(x) | Returns the hour of the day from x |
localtime | Returns the current time as of the start of the query |
localtimestamp | Returns the current timestamp as of the start of the query |
minute(x) | Returns the minute of the hour from x |
month(x) | Returns the month of the year from x |
now() | This is an alias for current_timestamp |
parse_datetime(string, format) | Parses string into a timestamp with time zone using format |
quarter(x) | Returns the quarter of the year from x |
Type | Function | Description |
|---|---|---|
Aggregate Function | * Refer Aggregate Functions tab | Agrregate Functions can be used as window functions by adding the OVER clause |
Ranking Function | cume_dist( ) | Returns the cumulative distribution of a value in a group of values |
Ranking Function | dense_rank() | Returns the rank of a value in a group of values |
Ranking Function | ntile(n) | Divides the rows for each window partition into n buckets ranging from 1 to at most n |
Ranking Function | percent_rank() | Returns the percentage ranking of a value in group of values |
Ranking Function | rank( ) | Returns the rank of a value in a group of values |
Ranking Function | row_number( ) | Returns a unique, sequential number for each row, starting with one |
Value Function | first_value(x) | Returns the first value of the window |
Value Function | last_value(x) | Returns the last value of the window |
Value Function | nth_value(x, offset) | Returns the value at the specified offset from beginning the window |
Value Function | lead(x[, offset[, default_value]] ) | Returns the value at offset rows after the current row in the window |
Value Function | lag(x[, offset[, default_value]]) | Returns the value at offset rows before the current row in the window starting at 0 |
DDL stands for Data Definition Language and is a part of the Structured Query Language (SQL) class. DDL statements are generally used to create or modify the structural metadata of the actual data. In Amazon Athena, objects such as Databases, Schemas, Tables, Views and Partitions are part of DDL.
Athena SQL DDL is based on Hive DDL, so if you have used the Hadoop framework, these DDL statements and syntax will be quite familiar.
Key point to note, not all Hive DDL statements are supported in Amazon Athena SQL. This is because data in Athena is stored externally in S3, and not in a database. For instance, DDL statements related to INDEXES, ROLES, LOCKS, IMPORT, EXPORT and COMMIT are not supported in Athena SQL.
The table below lists the 24 DDL statements supported in Athena SQL. For details on Athena DDL syntax, usage and parameters click here.
1. ALTER DATABASE SET DBPROPERTIES | 10. CREATE TABLE | 19. SHOW COLUMNS |
2. ALTER TABLE ADD COLUMNS | 11. CREATE TABLE AS | 20. SHOW CREATE TABLE |
3. ALTER TABLE ADD PARTITION | 12. CREATE VIEW | 21. SHOW CREATE VIEW |
4. ALTER TABLE DROP PARTITION | 13. DESCRIBE TABLE | 22. SHOW DATABASES |
5. ALTER TABLE RENAME PARTITION | 14. DESCRIBE VIEW | 23. SHOW PARTITIONS |
6. ALTER TABLE REPLACE COLUMNS | 15. DROP DATABASE | 24. SHOW TABLES |
7. ALTER TABLE SET LOCATION | 16. DROP TABLE | 25. SHOW TBLPROPERTIES |
8. ALTER TABLE SET TBLPROPERTIES | 17. DROP VIEW | 26. SHOW VIEWS |
9. CREATE DATABASE | 18. MSCK REPAIR TABLE |
|
DML stands for Data Manipulation Language and is a part of the Structured Query Language (SQL) class. In a relational database, every time a SELECT, INSERT, DELETE or UPDATE statement is executed you are manipulating data and thereby executing a DML statement.
When an Athena SQL DML statement is executed, it manipulates data stored in Amazon S3 (Simple Storage Service); therefore, support for DML statements like INSERT, DELETE, UPDATE and MERGE does not exist in Athena SQL.
Currently, the only Athena SQL DML supported is the SELECT statement.
Interact with Athena using shell commands from Windows PowerShell, Linux or remotely.
This is the latest user guide version of AWS Athena Documentation
| Cookie | Duration | Description |
|---|---|---|
| cookielawinfo-checkbox-analytics | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics". |
| cookielawinfo-checkbox-functional | 11 months | The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional". |
| cookielawinfo-checkbox-necessary | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary". |
| cookielawinfo-checkbox-others | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other. |
| cookielawinfo-checkbox-performance | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance". |
| viewed_cookie_policy | 11 months | The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data. |