Starburst Galaxy

  • Starburst Galaxy Home
  •   Get started
  •   Global features
  • Help center
  • Release notes
  • Feature release types

  • Starburst Galaxy UI
  •   Query
  •   Catalogs
  •   Catalog explorer
  •   Data products
  •   Clusters
  • Partner connect
  •   Admin
  •   Access control
  •   Cloud settings

  • Administration
  •   Security
  •   Single sign-on
  •   Troubleshooting
  • Galaxy status

  • Reference
  •   Python
  • API
  •   SQL
  •   Tutorials
  • PyStarburst #

    The PyStarburst library implements the standard Python DataFrame API, which uses a data structure called a DataFrame to analyze and manipulate two-dimensional data. Use PyStarburst to query and transform data in Starburst Galaxy clusters in a data pipeline using Python syntax.

    With PyStarburst, you can create complex transformation pipelines, build data apps, and interact with data using Python without moving data to the system where your application code runs.

    PyStarburst provides familiar syntax for writing and running production-grade ETL pipelines and data transformations. This makes it possible to not only build new pipelines but also to migrate existing PySpark or Snowpark workloads to Starburst Galaxy.

    For additional Python support in Starburst products, visit the Python clients page.

    Install the library #

    To install PyStarburst and its dependencies, run the following pip command from your command prompt:

    pip install https://starburstdata-downloads.s3.us-east-2.amazonaws.com/pystarburst/0.5.0/pystarburst-0.5.0-py3-none-any.whl
    

    Connect to your cluster #

    Use your preferred local development environment to connect to a Starburst Galaxy cluster. Establish a session using the same connection parameters you use to log into Starburst Galaxy.

    Specify these settings in a dictionary that associates parameter names with values. Then pass this dictionary to the Session.builder.configs method and call the create method to establish your session:

    import trino
    from pystarburst import Session
    
    db_parameters = {
        "host": "<host>",
        "port": <port>,
        "http_scheme": "https",
        "catalog": "sample",
        "schema": "burstbank"
        "auth": trino.auth.BasicAuthentication("<user>", "<password>")
    }
    session = Session.builder.configs(db_parameters).create()
    

    To determine the values for the connection parameters host, port, and user:

    1. Open Partner connect in the Starburst Galaxy navigation menu.
    2. Click the PyStarburst tile in the Drivers and clients section.
    3. From the Select cluster drop-down menu, select the cluster of interest.
    4. Copy the values from the User, Host, and Port fields.

    PyStarburst API reference #

    After you have established a connection with a cluster, use Python to construct DataFrames and query tables. PyStarburst has a number of methods to perform DataFrame operations on your data.

    View technical documentation for PyStarburst’s API methods at: https://pystarburst.eng.starburstdata.net/.

    Example Jupyter notebook #

    Try out PyStarburst using the example Jupyter notebook in the starburstdata/pystarburst-demo GitHub reop.