One of Starburst Galaxy’s many uses is as an interactive query engine to federate multiple data sources together. With the ability to query both workloads, Starburst Galaxy becomes a single point of access and allows you to access your data wherever it lives.
For this tutorial, analyze one of the datasets in the COVID-19 data lake and run interactive queries to discover the proper insight required to federate this data with the TPC-H dataset. Then, write a SQL query with both data sources to find the total case count by region.
This guide walks you through:
Want to see it in action? Watch a video that demonstrates this tutorial.
accountadmin
role.Catalogs contain the configuration and connection information needed to access a data source. To gain this access, configure a catalog and use it in a cluster.
This tutorial uses the public COVID-19 data lake that lives on AWS, which simplifies the setup process. Instead of creating an S3 catalog, where you are required to have access to an AWS account, the data for this tutorial is accessible without charge and without AWS credentials. You connect to the AWS COVID-19 data lake dataset catalog.
aws_covid
.COVID data from AWS public
dataset
.US East (Ohio) region
to the catalog as this is the only
region where the dataset lives.Select Save access controls on the Set permissions page to add the default access controls.
A Starburst Galaxy cluster provides the resources necessary to run queries against your catalogs. Use the Starburst Galaxy Cluster explorer to access the catalog data exposed by running clusters.
aws-covid-cluster
.aws_covid
, tpch
.Confirm that you have connected your cluster to both the aws_covid
and tpch
catalogs by exploring the aws-covid-cluster
in the Cluster explorer. If
you do not have both catalogs, navigate to the Admin > Clusters pane and
edit your cluster.
After configuring the catalog and cluster, you are directed to the query editor to continue the tutorial. Notice that the location drop-down menus in the top left corner have already selected the proper cluster and catalog for querying.
The Starburst Galaxy access control system uses roles to provide users with privileges for clusters, catalogs, schemas, tables and other types of entities, such as object storage locations.
Configure the accountadmin
role to access the COVID-19 data lake location.
(But if you have already gone through these steps at the direction of a
different Starburst Galaxy tutorial, there is no need to repeat them.)
accountadmin
role name.accountadmin
role.
s3://covid19-lake/*
.You now have access to all the data sources available in the COVID-19 data lake.
Navigate to the Query editor. Validate that the location drop-down menus in the top left corner match the cluster and catalog previously created.
Create a schema to hold the COVID-19 table.
CREATE SCHEMA query_federation;
In the top left corner, select the query_federation
schema so that
you can easily run the tutorial queries.
The enigma_jhu dataset provides the Global Coronavirus Data and is sourced from John Hopkins and provided by Enigma. This data tracks confirmed COVID-19 cases in provinces, states, and countries across the world, while also providing a county level breakdown in the United States.
Edit your saved query to match the following SQL command to create the table.
CREATE TABLE enigma_jhu (
fips VARCHAR,
admin2 VARCHAR,
province_state VARCHAR,
country_region VARCHAR,
last_update VARCHAR,
latitude DOUBLE,
longitude DOUBLE,
confirmed INTEGER,
deaths INTEGER,
recovered INTEGER,
active INTEGER,
combined_key VARCHAR
)
WITH (
format = 'json',
EXTERNAL_LOCATION = 's3://covid19-lake/enigma-jhu/json/')
;
Run a SELECT *
command to view your results. You can also use the table’s
options menu to generate the command for you in the query editor.
SELECT * FROM enigma_jhu LIMIT 10;
Notice that the admin2
column is actually the county, but was improperly
named. You can also see that the case information is an aggregate value of each
previously updated timestamp. We account for this as you query.
Starburst Galaxy allows you to run one query and return values that join together information from multiple data sources. In this example first investigate the data provided from the COVID-19 data lake and then join this with the TPC-H dataset which is stored in an entirely different location.
Analyze the enigma_jhu table to determine the most valuable insights to be
reported. Run a SELECT
query to return most of the pertinent values
and observe the nature of the data within the table.
SELECT
fips,
province_state,
country_region,
last_update,
confirmed,
recovered,
active
FROM
enigma_jhu
ORDER BY
fips;
There are many different last_update
times. Evaluate one specific FIPS code
for more information on the data contained within the table.
SELECT
fips,
province_state,
country_region,
last_update,
confirmed,
recovered,
active
FROM
enigma_jhu
WHERE
fips = '36121'
ORDER BY
last_update DESC;
Look at the confirmed case count. It’s an aggregate, adding the previous
confirmed cases to any additionally confirmed cases. Only evaluate data that has
a last_update
equal to the maximum value.
Run a query to find the most recent update.
SELECT max(last_update) FROM enigma_jhu;
Only evaluate data that has a last_update
equal to the maximum value.
SELECT
fips,
province_state,
country_region,
last_update,
confirmed
FROM
enigma_jhu
WHERE
last_update = '2020-05-30T02:32:48'
ORDER BY
last_update DESC;
This query is successful for any location that has a last_update
value
equal to that maximum entry. However, this is not accurate for any location that
has a different last_update
value not equivalent to the maximum entry as
their total count will be excluded. Run the following example focusing on the
state of Utah that this issue.
SELECT DISTINCT
fips
FROM
enigma_jhu
WHERE
province_state = 'Utah'
AND fips NOT IN
(
SELECT
fips
FROM
enigma_jhu
WHERE
province_state = 'Utah'
AND last_update = '2020-05-30T02:32:48'
);
The query returns all the distinct FIPS codes in Utah that do not have an entry
containing the maximum last_update
value. Look specifically at one FIPS
code within Utah.
SELECT * FROM enigma_jhu WHERE fips = '49005' ORDER BY last_update DESC;
The maximum last_update
for Cache, Utah is on April 16th, 2020, which is
at least one month before the final May 30th, 2020 date. Therefore, you need a
different solution for calculating the sum of confirmed cases. Add a first
value window function to
create a column identifying the most recent update for each fips
.
SELECT
fips,
admin2 AS county,
province_state,
country_region,
confirmed,
first_value(last_update) OVER (
PARTITION BY fips ORDER BY last_update DESC) AS most_recent,
last_update
FROM
enigma_jhu;
To only return records that contain the latest last_update
value for each
fips
, run a nested query which only selects values where
last_update = most_recent
.
SELECT
fips,
county,
province_state,
country_region,
confirmed,
last_update
FROM
( SELECT
fips,
admin2 AS county,
province_state,
country_region,
confirmed,
first_value(last_update) OVER (
PARTITION BY fips ORDER BY last_update DESC) AS most_recent,
last_update
FROM
enigma_jhu
) cases
WHERE
last_update = most_recent
GROUP BY
fips,
county,
province_state,
country_region,
confirmed,
last_update;
You can also run this same query using a WITH
statement.
WITH
cases AS (
SELECT
fips,
admin2 AS county,
province_state,
country_region,
confirmed,
first_value(last_update) OVER (
PARTITION BY fips ORDER BY last_update DESC) AS most_recent,
last_update
FROM
enigma_jhu
)
SELECT
fips,
county,
province_state,
country_region,
confirmed,
last_update
FROM
cases
WHERE
last_update = most_recent
GROUP BY
fips,
county,
province_state,
country_region,
confirmed,
last_update;
The TPC-H dataset provides two tables of interest: the nation table and the region table. Both tables assist in aggregating the total confirmed cases per region.
Run a query to achieve familiarity with the nation table.
SELECT * FROM tpch.tiny.nation LIMIT 10;
There are 25 nations within the table.
Run a query to achieve familiarity with the region table.
SELECT * FROM tpch.tiny.region LIMIT 10;
There are 5 regions in the table.
The region_key
in the region table acts as the foreign key in the nation
table. Visit the TPC-H dataset page for more
information on the relationships within the dataset.
To determine the region aggregation, join the data lake table with the TPC-H nation table. Only consider the 25 countries that are accounted for in the nation table.
Run the query to append the COVID-19 data with the proper region. Notice this query joins together data from two different data sources.
SELECT
country_region,
first_value(last_update) OVER (
PARTITION BY fips ORDER BY last_update DESC) AS most_recent,
last_update,
confirmed,
nationkey,
name,
regionkey
FROM
aws_covid.query_federation.enigma_jhu enigma
INNER JOIN tpch.tiny.nation nation ON UPPER(enigma.country_region) = nation.name
ORDER BY
confirmed DESC;
This query is a good start; however, it only returns 24 of the 25 countries. Run a distinct query to identify which country is missing.
SELECT DISTINCT
name
FROM
tpch.tiny.nation
WHERE
name NOT IN (
SELECT
UPPER(country_region)
FROM
aws_covid.query_federation.enigma_jhu);
The United States is assigned the country_region value of ‘US’ in the Enigma data. Fix the join query to account for this mismatch.
SELECT
country_region,
confirmed,
nationkey,
name,
regionkey,
FIRST_VALUE(last_update) OVER (
PARTITION BY fips ORDER BY last_update DESC) AS most_recent,
last_update
FROM
aws_covid.query_federation.enigma_jhu enigma
INNER JOIN tpch.tiny.nation nation ON UPPER(enigma.country_region) = REPLACE(nation.name, 'UNITED STATES', 'US')
ORDER BY
nationkey DESC;
Add another inner join to append the region information to each record.
SELECT
country_region,
nationkey,
confirmed,
region.name AS region_name,
FIRST_VALUE(last_update) OVER (
PARTITION BY fips ORDER BY last_update DESC) AS most_recent,
last_update
FROM
aws_covid.query_federation.enigma_jhu enigma
INNER JOIN tpch.tiny.nation nation ON UPPER(enigma.country_region) = replace(nation.name, 'UNITED STATES', 'US')
INNER JOIN tpch.tiny.region region ON nation.regionkey = region.regionkey;
Now, put the puzzle pieces together to fulfill the initial ask of aggregating the confirmed case count by region.
WITH
cases AS (
SELECT
country_region,
nationkey,
confirmed,
region.name AS region_name,
FIRST_VALUE(last_update) OVER (
PARTITION BY fips ORDER BY last_update DESC) AS most_recent,
last_update
FROM
aws_covid.query_federation.enigma_jhu enigma
INNER JOIN tpch.tiny.nation nation ON UPPER(enigma.country_region) = REPLACE(nation.name, 'UNITED STATES', 'US')
INNER JOIN tpch.tiny.region region ON nation.regionkey = region.regionkey
)
SELECT
SUM(confirmed) AS total_confirmed_cases,
region_name
FROM
cases
WHERE
last_update = most_recent
GROUP BY
region_name
ORDER BY
total_confirmed_cases DESC;
Now that you have explored the query federation capabilities of Starburst Galaxy, explore another tutorial, or connect your own data.
Is the information on this page helpful?
Yes
No