This page describes how to view statistics and data quality information about a table or view using the Quality tab offered at the table or view level of the catalog explorer. This feature is available for object storage catalogs only.
The Data quality section provides the following information:
The Number of records chart plots the changes in the table’s total number of records over the last 30 days. Refreshing the statistics data creates a new data point.
This section describes the information included in your data profile, and walks you through how to create a data quality rule.
The Column profile button displays the number of columns in the table and displays the following statistics:
varchar
.null
.null
.Min value: The minimum value for columns with numerical data.
Non-applicable columns are labeled null
.
To reorder the statistics table, click a column name to sort by that column. Click the column name again to reverse the sort order.
Data-driven organizations need quick access to data to make critical decisions. With Starburst Galaxy’s data profiling capabilities, you can view the shape of your data and create rules to monitor data quality. These are limited to object store connectors and a limited number of metrics consisting of null value percentage, minimum and maximum values, number of unique values, and volume of data.
Click the Applied rules button to view your data quality monitoring rules, or to create a new rule. The Applied rules tab displays a list of information about each rule such as name, status, description, expression or SQL statement, and severity level.
To create a new quality rule:
Validity
, Completeness
, Accuracy
, Consistency
, and None
.Low
to Critical
.Click themore_vertoptions menu to evaluate, edit, or delete any quality check rule.
You can write an expression to determine the scope of a data quality monitoring rule. For more details and examples of expression-based quality checks, see Data quality expressions in the Reference navigation menu.
You can write SQL statements that evaluate to a boolean that can be run on a
schedule to validate the data. There are two methods for writing a SQL-based
data quality check: a Common Table Expression (CTE) that generates rows and is
followed by a COUNT
CASE
statement, or a subquery embedded in a CASE
statement.
An example using the CTE method that returns one or more rows if the quality
check fails. If COUNT(*)
of CTE evaluates to 0
, return TRUE
, otherwise
return FALSE
:
WITH cte AS(
SELECT-based <data quality check>
)
SELECT
CASE
WHEN COUNT(*) = 0 THEN TRUE
ELSE FALSE
END
FROM cte
An example using the subquery method:
SELECT
CASE
WHEN avg(totalprice) > 5000 THEN TRUE
ELSE FALSE
END
FROM "tpch"."tiny"."orders"
You can author checks for a single column attribute, multiple attribute checks across multiple columns, or attribute checks at the row-level.
A quality check can be executed manually, or scheduled as a cron job. Checks run on a user’s own cluster and incur compute costs.
;
.The data quality dashboard is available in the Metrics tab of the catalog explorer for the catalog, schema, and table levels.
Dashboard visibility depends on the active role set’s table privileges. The role must have at least the Select from table privilege on the table of interest to be able to see a dashboard created by others. All of the table’s column’s must be visible to the role, with no DENY privileges on any column.
To create a new data quality check routine, or to edit or delete one, the role must have ownership of the table or schema of interest or have the Manage table data observability privilege on the table of interest or its containing schema or catalog.
Data quality results are recorded and displayed in the data quality dashboard. The dashboard lists useful metrics such as the total number of checks executed, the number of passed and failed quality checks, and the number of unprocessed quality checks. The unprocessed category includes quality checks that have not been scheduled or evaluated manually.
Additionally, a graph that displays data quality trends over the last 30 days is available.
To view the status of all quality checks, click View all checks. For a more
detailed view of passed, failed, or unprocessed quality checks, click the
corresponding STATUS_CATEGORY
link for the status category of
interest.
Each link opens a dialog that organizes quality checks in the following columns:
STATUS_CATEGORY
checks: The total number of quality checks executed.The Quality checks section lets you see all quality checks grouped by
passed, failed, or unprocessed status. Select a status category from the
drop-down menu to see the top 4 quality checks in the specified category. To see
more quality checks, click View more STATUS_CATEGORY
checks.
You can also view a graphical summary of the categories and severity levels of your quality checks:
At the table or view level of the catalog explorer, use the top section of the Quality pane to schedule data optimization tasks for object storage tables, and to view profile data for the current table or view.
Choose a cluster in the Select cluster menu, then click Show profile stats. If you have already collected statistics for the current table, this button is instead labeled Refresh stats.
Use the same drop-down menu to validate query rules with Evaluate quality checks.
Collecting statistics might take some time if the selected cluster needs to restart.
To schedule a data optimization job, click event_repeat Set optimization schedule. Provide the following information in the Configure data optimization dialog:
Maintenance task | Description | Iceberg | Delta Lake | Hive | Hudi |
---|---|---|---|---|---|
Compaction | Improves performance by optimizing your data file size. | check_circle | check_circle | ||
Profiling and statistics | Improves performance by analyzing the table and collecting statistics about your data. | check_circle | check_circle | check_circle | check_circle |
Data retention | Reduces storage by deleting data snapshots according to the number of days you specify. You must specify a retention period between 7 and 30 days. | check_circle | |||
Vacuuming | Reduces storage by deleting orphaned data files. | check_circle | check_circle | ||
Evaluate custom data quality checks | Evaluates custom SQL data quality checks. | check_circle | check_circle | check_circle |
In the Job schedule section:
For Select frequency: Choose an hourly, daily, weekly, monthly, or annual schedule from the drop-down menu. The corresponding values depend on the schedule:
hh:mm
, then specify AM or PM.hh:mm
, specify AM or PM, then
select a day of the week.hh:mm
, specify AM or PM, then
select a date.MM/DD
hh:mm
. Specify AM or PM.For Enter cron expression: Enter the desired schedule in the form of a UNIX cron expression. For example, a cycle scheduled to run weekly at 9:30 AM on Monday, Wednesday, and Friday:
30 9 * * 1,3,5
Any scheduled data optimization jobs appear in the header section. Use the Configure data optimization to edit or delete a data optimization job.
Is the information on this page helpful?
Yes
No