Starburst Galaxy

  •  Get started

  •  Working with data

  •  Data engineering

  •  Developer tools

  •  Cluster administration

  •  Troubleshooting

  • Galaxy status

  •  Reference
  • Automatic data classification #

    In Starburst Galaxy, you can use data classifier jobs to automatically classify data in catalogs, schemas, tables, and views.

    When used in conjunction with tags and policies, classification provides an automated way to perform governance on your data.

    Data classification jobs analyze the data and metadata of your catalogs, schemas, and tables and proposes tags on columns. Administrators choose whether to accept or reject the tag proposal, and can change the color or name of the proposed tag.

    Get Started #

    A role in the user’s active role set must have the account-level privilege Manage Security in order to create, update, view, or delete classification jobs.

    The classifier job queries data on a cluster in the account using a role the user specifies that must be in the user’s active role set. Because queries execute on the cluster, the specified role must have the Use Cluster privilege on the cluster. The specified role must have at least one of Create Tag or Apply Tag privileges to suggest proposed tags. Additionally, only data for which the role has a SELECT grant is analyzed.

    Create a data classifier job #

    There are two methods to create data classifier jobs.

    When viewing a catalog, schema, table or view from the catalog explorer, click Auto Tag, then Create a classifier. A dialog opens where you can configure the classifier job.

    create a data classifier job

    You can also create data classifier jobs by selecting Access control > Data classifier jobs from the navigation menu. From this pane, classifiers can be edited or deleted; suggested tag results from previous jobs can be accepted or rejected. Additionally, you can kick off new runs of previously created classifier jobs from this pane by selecting a classifier and clicking Run now.

    Data classification job components #

    A classification job has the following characteristics:

    Attribute Description
    Name and description The name and description given to the data classifier job.
    Cluster The cluster on which the classifier job executes queries to sample the data. All catalogs, schemas, and tables to classify must be attached to the cluster. All queries the classifier job executes are recorded in query history.
    Execution role The role that executes queries on the cluster. It must be a role in the user's active role set. Classification occurs only on columns on which the role's active role set has SELECT privileges.
    Catalogs, schemas, and tables The catalogs, schemas, and tables to be classified. Multiple catalogs, schemas, or tables can be chosen. The more tables contained in the job, the longer the job will take.
    Classifiers The groups of data to check for. At least one classifier must be selected.
    Schedule Optionally, run classification jobs on a schedule. Choose a time zone from the drop-down menu, then select a frequency or enter a cron expression to set a schedule for the classification job to run. Select Execute immediately to run the classification job immediately.

    View, accept, or discard proposed tags #

    The classification job recommends tags as it comes across a table or column that could fit a requested category. Tags may be recommended while the job is still executing.

    Follow these steps to accept or reject a proposed tag:

    1. Click Auto tag from the catalog explorer on an entity which has had a classifier job run.
    2. From the resulting dialog, select the checkbox next to tags you would like to accept or reject. Alternatively, click the add icon next to a suggested tag name to open a drop-down menu where you can select additional, previously created tags to apply to the entity.
    3. Click the corresponding button to apply or discard the selected tags. Clicking Apply selected tags on a proposed tag creates the tag if it does not already exist and applies the tag to the column attached to it. Clicking Discard selected tags removes the proposed tag from the suggested tags list. Future classifier job runs that propose the same tag are not shown.

    auto tag suggested tags

    Alternatively, navigate to Access control > Data classifier jobs and click View results for a specific job in the list of all data classifier jobs. This opens the dialog of suggested tags.

    Supported classification categories #

    Classifier Group Data Category Default Tag
    PII E-Mail Address pii.email
      Full Name pii.full_name
      First Name pii.first_name
      Last Name pii.last_name
      Phone Number pii.phone_number
      Street Address pii.address
      Social Security Number (SSN) pii.us_ssn
      Individual Taxpayer Identification Number (ITIN) pii.us_itin
      Preparer Taxpayer Identification Number (PTIN) pii.us_ptin
      Adoption Taxpayer Identification Number (ATIN) pii.us_atin
      Passport Number pii.passport
      International Mobile Equipment Identifier (IMEI) pii.imei
      IP Address pii.ip_address
      MAC Address pii.mac_address
      URL pii.url
      International Bank Account Number (IBAN) pii.iban
      US Bank Account Number pii.us_bank_num
      US Drivers License Number pii.us_driver_num
      UK National Health Service Number (NHS) pii.uk_nhs_num
      UK Drivers License Number pii.uk_driver_num
    LOCATION Street Address pii.address
      ZIP Code pii.zip_code