Starburst Galaxy

  •  Get started

  •  Working with data

  •  Data engineering

  •  Developer tools

  •  Cluster administration

  •  Security and compliance

  •  Troubleshooting

  • Galaxy status

  •  Reference

  • Data maintenance #

    From the Starburst Galaxy navigation menu, select Data > Data maintenance.

    Data maintenance jobs run tasks that improve performance and reduce storage in Apache Iceberg tables. Supported tasks include data file compaction, statistics collection, and deletion of outdated snapshots and orphaned files.

    To perform data maintenance operations on live tables, see data maintenance for Kafka streaming ingestion and data maintenance for file ingestion

    The Data maintenance pane has the following levels:

    • Top: Shows a list of catalogs that include at least one maintenance task.
    • Catalog: For the current catalog, shows a list of its schemas with maintenance tasks.
    • Schema: For the current schema, shows a list of its tables with maintenance tasks. You can create, edit, or delete tasks that apply to the entire schema from this level.
    • Table: Shows the current table’s defined maintenance tasks and run history. From this level, you can run, edit, or delete the current data maintenance job.

    Create data maintenance job #

    To create a data maintenance job, click Create maintenance task in the top, catalog, or schema details levels.

    Provide the following information in the Configure data maintenance dialog:

    • In the Maintenance target section:
      • Specify a catalog from the drop-down menu. If you opened this dialog from the catalog or schema details levels, this field is pre-populated.
      • Click the All schemas toggle to include all schemas or to select individual schemas from the drop-down menu. This selection automatically stops maintenance jobs for schemas that are removed. To include schemas with separate schedules, see Edit data maintenance jobs.
      • Click the All tables toggle to include all tables. This option only appears if an individual schema was previously selected. If chosen, this selection automatically applies maintenance tasks to all future Iceberg tables created as part of the previously selected schema. This selection also automatically stops maintenance jobs for tables that are removed. To include tables with separate schedules, see Edit data maintenance jobs.
      • Select the Select tables radio button to invoke the table drop-down menu. Expand the menu, and select one or multiple tables.

    • In the Maintenance tasks section, select at least one maintenance task:
    Maintenance task Description
    Compaction Improves performance by optimizing your data file size.
    Profiling and statistics Improves performance by analyzing the table and collecting statistics about your data.
    Snapshot expiration Reduces storage by deleting data snapshots.
    Delete orphan files Reduces storage by deleting orphaned data files. This rule includes files that are not part of a table.
    • In the Execution details section, select an executing role and a cluster from the Select cluster the respective drop-down menus.

    • In the Job schedule section:

      • Select a Time zone from the drop-down menu.
      • Choose the Select frequency or Enter cron expression recurring interval format.

      For Select frequency: Choose an hourly, daily, weekly, monthly, or annual schedule from the drop-down menu. The corresponding values depend on the schedule:

      • Hourly: Enter a value between 0 minutes and 59 minutes.
      • Daily: Enter a time in the format hh:mm, then specify AM or PM.
      • Weekly: Enter a time in the format hh:mm, specify AM or PM, then select one or more days of the week.
      • Monthly: Enter a time in the format hh:mm, specify AM or PM, then select a date.
      • Annually: Enter a month, day, hour, and minutes in the format MM/DD hh:mm. Specify AM or PM.

      For Enter cron expression: Enter the desired schedule in the form of a UNIX cron expression. For example, a cycle scheduled to run weekly at 9:30 AM on Monday, Wednesday, and Friday:

    30 9 * * 1,3,5
    
    • Click Save.

    Data maintenance job details #

    All scheduled data maintenance jobs are listed in the Data maintenance pane beginning at the top details level.

    Shared elements in data maintenance #

    The header section includes a maintencance task legend, which explains the task icons:

    • compress Compaction
    • search_insights Profile and statistics
    • deployed_code_history Snapshot expiration
    • vacuum Delete orphan files

    The following icons show the status of the data maintenance jobs:

    • hourglass_top Queued
    • check_circle Completed
    • error Failed
    • schedule Scheduled to run
    • sync Currently running

    The Last run status drop-down menu at the catalog and schema details levels lets you restrict the list to jobs that are scheduled, running, completed, or failed.

    The Maintenance task drop-down menu at the catalog and schema detail levels lets you restrict the list to a single task type.

    The Search field at the catalog and schema detail levels lets you restrict the list to the search criteria.

    On the catalog and schema detail levels, an subdirectory_arrow_right arrow indicates that maintenance tasks for the entity are inherited from its parent schedule.

    The options menu on the catalog and schema details levels provides a play_circle run now action, lets you edit the data maintenance job, and presents an option to create a independent data maintenance job for child entities.

    Top level details #

    The Maintenance, Schedules, and Errors tabs in the top level details pane provide high-level insight into your data maintenance jobs.

    The Maintenance tab displays the list of catalogs and the following information:

    • Catalog: The name of the catalog with defined data maintenance tasks.
    • Schemas with maintenance: The total number of schema-level jobs.
    • Tables with maintenance: The total number of table-level jobs.
    • Status: The status of the data maintenance job.
    • Tasks: The icons representing the tasks included in the data maintenance job.
    • Next run: The date and time that the data maintenance job is scheduled to run.

    The Schedules tab presents a quick look at all scheduled data maintenance jobs and provides the following information:

    • Maintenance target: The location of your data maintenance job, whether it is on the catalog, schema, or table level.
    • Tasks: The icons representing the tasks included in the data maintenance job.
    • Executing role: The specified executing role.
    • Cluster: The specified cluster.
    • Time zone: The chosen time zone.
    • Next run: The date and time that the data maintenance job is scheduled to run.

    The Errors tab provides the following information for the last 100 failed data maintenance jobs:

    • Maintenance target: The complete path of the data maintenance job.
    • Status: The status of the data maintenance job. For failed jobs, the Debug link opens a dialog with information about the failed task.
    • Tasks: The icons representing the tasks included in the data maintenance job.
    • Timestamp: Date, time, and timezone in which the error occurred.

    Catalog level details #

    To view catalog level details, click the name of a catalog from the top details level.

    As with other panes in Galaxy, the top row of this pane provides catalog > schema > table breadcrumbs to show which details level you are on. Click the names in the breadcrumb list to navigate among the levels.

    Catalog level details are organized in the following columns:

    • Location: The specified schema.
    • Status: The status of the data maintenance job.
    • Tasks: The icons representing the tasks included in the data maintenance job.
    • Last run: The date and time that the data maintenance job was last run.
    • Next run: The date and time that the data maintenance job is scheduled to run.

    Schema level details #

    To view schema level details, click the name of a schema from the catalog details level. The schema level details list can include individual tables or maintenance tasks set up to run for all tables in a schema.

    Schema level details are organized in the following columns:

    • Location: The tables with defined maintenance tasks.
    • Status: The status of the data maintenance job.
    • Tasks: The icons representing the tasks included in the data maintenance job.
    • Last run: The date and time that the data maintenance job was last run.
    • Schedule: The next scheduled run time.
    • The more_vertoptions menu.

    The Status and Tasks columns remain blank until a task has been executed at least once.

    Table level details #

    For more information on individual data maintenance jobs, click a table name at the schema details level.

    The title of the table details level task pane is the name of the table. The top portion at the pane provides a summary of the selected data maintenance job and a Run now button.

    The Task history section is organized in the following columns:

    • Query ID: The unique identifier for the statement. Click the Query ID to view Query details.
    • Task: The selected data maintenance task.
    • Run ID: The unique identifier for the task run.
    • Status: The status of the data maintenance job. For failed jobs, the Debug link opens a dialog with information about the failed task.
    • Started: When the data maintenance job started.
    • Elapsed time: The duration of data maintenance job.

    Manage data maintenance jobs #

    Manage your data maintenance jobs on the catalog and schema details levels, and on the schedules tab at the top details level.

    Edit data maintenance jobs #

    To make edits to a data maintenance job, follow these steps:

    • Click themore_vertoptions menu in the row, then select Edit.
    • Make changes, then click Save.

    Existing data maintenance jobs exclude tables that have separate maintenance schedules. To include these tables, delete the data maintenance job associated with them. After you delete the job, the previously excluded tables are included in the data maintenance job.

    Delete data maintenance jobs #

    To delete a data maintenance job, follow these steps:

    • Click themore_vertoptions menu in the row of the table of interest.
    • Select Delete job.
    • In the dialog, Yes, delete.