Starburst Galaxy

  •  Get started

  •  Working with data

  •  Data engineering

  •  Developer tools

  •  Cluster administration

  •  Security and compliance

  •  Troubleshooting

  • Galaxy status

  •  Reference

  • Data maintenance #

    From the Starburst Galaxy navigation menu, select Data > Data maintenance.

    Data maintenance jobs run tasks that improve performance and reduce storage in Apache Iceberg tables. Supported tasks include: Compaction, Profile and statistics, Snapshot expiration, and Delete orphan files.

    To perform data maintenance operations on live tables, see data maintenance for Kafka streaming ingestion and data maintenance for file ingestion

    The Data maintenance pane has the following levels:

    • Top level, which shows a list of catalogs that include at least one maintenance task.
    • Catalog level. Click the name of a catalog to see the schemas with maintenance tasks.
    • Schema level. Click the name of a schema to see the tables with maintenance tasks.
    • Table level. Click the name of a table to see its defined maintenance tasks. From this level, you can run, edit, or delete a task and can view the task’s run history.

    Create data maintenance task #

    To schedule a data maintenance task, click Create maintenance task in the top, catalog, or schema levels.

    Provide the following information in the Configure data maintenance dialog:

    • In the Maintenance target section:
      • Specify a catalog and schema from the respective drop-down menus. If you opened this dialog from the catalog or schema levels. those fields are pre-selected.
      • Specify one or multiple tables to perform maintenance tasks on:
        • Select the All tables radio button to include all tables. This selection automatically applies maintenance tasks to all future Iceberg tables created as part of this schema. To include tables with separate schedules, see Edit data maintenance jobs.
        • Select the Select tables radio button to invoke the table drop-down menu. Expand the menu, and select one or multiple tables.
    • In the Maintenance tasks section, select at least one maintenance task:
    Maintenance task Description
    Compaction Improves performance by optimizing your data file size.
    Profiling and statistics Improves performance by analyzing the table and collecting statistics about your data.
    Snapshot expiration Reduces storage by deleting data snapshots.
    Delete orphan files Reduces storage by deleting orphaned data files.
    • In the Execution details section, select an executing role and a cluster from the Select cluster the respective drop-down menus.

    • In the Job schedule section:

      • Select a Time zone from the drop-down menu.
      • Choose the Select frequency or Enter cron expression recurring interval format.

      For Select frequency: Choose an hourly, daily, weekly, monthly, or annual schedule from the drop-down menu. The corresponding values depend on the schedule:

      • Hourly: Enter a value between 1 minute and 59 minutes.
      • Daily: Enter a time in the format hh:mm, then specify AM or PM.
      • Weekly: Enter a time in the format hh:mm, specify AM or PM, then select a day of the week.
      • Monthly: Enter a time in the format hh:mm, specify AM or PM, then select a date.
      • Annually: Enter a month, day, hour, and minutes in the format MM/DD hh:mm. Specify AM or PM.

      For Enter cron expression: Enter the desired schedule in the form of a UNIX cron expression. For example, a cycle scheduled to run weekly at 9:30 AM on Monday, Wednesday, and Friday:

    30 9 * * 1,3,5
    
    • Click Save.

    Data maintenance job details #

    All scheduled data maintenance jobs are listed in the Data maintenance pane beginning at the top level.

    As with other panes in Starburst Galaxy, the top row of this pane provides catalog-schema-table breadcrumbs to show which detail level you are on. Click names in the breadcrumb list to navigate among the levels.

    Top level details #

    The header for the top level shows the total number of catalogs with tasks, and provides a search bar and drop-down menus that let you customize which details appear in the list of catalogs.

    To view data maintenance jobs of a certain status, use the Last run status drop-down menu. Use the search bar to find catalogs.

    The list of catalogs has the following columns:

    • Catalog: The name of the catalog with defined maintenance tasks.
    • Schemas with maintenance: The total number of schema-level jobs.
    • Tables with maintenance: The total number of table-level jobs.

    Filters available #

    The Search field at the top, catalog, and schema levels lets you restrict the list below to matching values.

    The Last run status drop-down menu at the top, catalog, and schema levels lets your restrict the list to tasks that are scheduled, running, completed, or failed. The default setting is to show all statuses.

    The Maintenance task drop-down menu at the catalog and schema levels lets you restrict the list to one of the four task types. The default setting is to show all task types.

    Catalog level details #

    To view catalog level details, click the name of a catalog from the top level.

    Catalog level details are organized in the following columns:

    • Location: The specified schema.
    • Tasks: The key explains the task type symbols:
      • Compaction compress
      • Profile and statistics search_insights
      • Snapshot expiration deployed_code_history
      • Delete orphan files vacuum
    • Last run: The date and time the data maintenance job was last run.
    • Schedule: The next scheduled run time.

    Schema level details #

    To view schema level details, click the name of a schema from the catalog level. The schema level list can include individual tables or maintenance tasks set up to run for all tables in a schema.

    Schema level details have the following columns:

    • Location: The tables with defined maintenance tasks.
    • Status: An icon showing status of the data maintenance job:
      • check_circle Success
      • error Failed
      • schedule Scheduled to run
      • sync Currently running
    • Tasks: The key explains the task symbols:
      • Compaction compress
      • Profile and statistics search_insights
      • Snapshot expiration deployed_code_history
      • Delete orphan files vacuum
    • Last run: The date and time the data maintenance job was last run.
    • Schedule: The next scheduled run time.
    • The more_vertOptions menu.

    The Status and Tasks columns are empty until a task’s first run.

    Use the options menu to edit the task or to run it now without waiting for its start time.

    Table level details #

    For more information on individual data maintenance jobs, click a table name from the schema level.

    The title of table level task panes is the name of the table. The top portion at the table level provides a summary of the selected data maintenance job, a Run now button, and an options menu that allows you to edit the task.

    The Task history section is organized in the following columns:

    • Query ID: The unique identifier for the statement. Click the Query ID to view Query details.
    • Task: The selected data maintenance task.
    • Run ID: The unique identifier for the task run.
    • Status: The status of the data maintenance job. A green check check indicates a successful maintenance task run. A circled error exclamation mark indicates an error condition; the Debug link opens a dialog with information about the failed task.
    • Started: When the data maintenance job started.
    • Elapsed time: The duration of data maintenance job.

    Manage data maintenance jobs #

    All editing is performed at the schema and table levels. The schema level allows for bulk and individual edits, while the table level allows for individual task edits only.

    Edit data maintenance jobs #

    To make bulk edits, go to the schema level, and follow these steps:

    • Click themore_vertoptions menu in the header.
    • Click Edit tasks.
    • Click the top checkbox to select all tables or click the checkboxes beside the tables of interest to select individually.
    • Click Bulk editto invoke the Edit data maintenance dialog.
    • Make changes, then click Save.

    To make individual edits:

    • At the schema level:
      • Select a table name from list of tables.
      • Click themore_vertoptions menu in the row, then select Edit.
      • Make changes, then click Save.
    • At the table level:
      • Click themore_vertoptions menu in the header, then select Edit.
      • Make changes, then click Save.

    Existing data maintenance jobs exclude tables with separate maintenance schedules. To include these tables, delete the data maintenance job associated with the table. The tables are automatically included in the data maintenance job as part of the schema.

    Delete data maintenance jobs #

    To delete tasks:

    • At the schema level:
      • Click themore_vertoptions menu in the row.
      • Select Delete tasks, the click Yes, delete.
    • At the table level:
      • Click themore_vertoptions menu in the header, then select Edit.
      • Select Delete tasks, the click Yes, delete.