Starburst Galaxy

  •  Get started

  •  Working with data

  •  AI workflows

  •  Data engineering

  •  Developer tools

  •  Cluster administration

  •  Security and compliance

  •  Troubleshooting

  • Galaxy status

  •  Reference

  • AI functions #

    AI functions let you invoke large language models (LLMs) to perform a variety of text-based tasks or construct retrieval-augmented generation (RAG) workflows in your data applications. You can perform vector searches using embeddings generated from data stored in your data lake, and then use those search results as input to prompt an LLM.

    Starburst Galaxy supports a number of task, prompt, and embedding specific functions. To use the AI functions, ensure you have configured an AI model, and that the role you assume has been granted the required privileges to use the model and execute each specific function.

    The AI functions are available within the default starburst catalog. This catalog is created automatically after a cluster starts, for both new and existing clusters. After creation, it is enabled each time the cluster starts. You do not need to add the catalog to a cluster. The catalog is visible in the query editor and catalog explorer provided that the role you assume has either the Manage security privilege, or any privilege on any entity within the catalog, such as the Execute function privilege on any of the AI functions, or a SELECT grant on any table/column/wildcard in the schema.

    The functions use the starburst.ai catalog and schema prefix. In the ai schema, there are two tables that list the available embedding and language models, embedding_models and language_models, respectively.

    Embedding function #

    The following function generates embeddings for input text using the given embedding model. Embeddings are numerical vector representations of text that can be used for semantic similarity and clustering tasks.

    To use this function, ensure that the role you assume has the Execute generate_embedding table function privilege at the starburst catalog level.

    • generate_embedding(text, model)

    Generate the embedding for the given text.

    SELECT generate_embedding('Which chapter should I read to understand how to
    balance the weight of a Boeing 747?', 'bedrock_titan');
    -- [ 0.0061195422895252705, 0.013783863745629787, ...]
    

    Prompt function #

    The following function generates text or responses based on the input prompt using the given language model.

    To use this function, ensure that the role you assume has the Execute prompt table function privilege at the starburst catalog level.

    • prompt(text, model) → varchar

    Generates text based on the input prompt.

    SELECT prompt('What is the capital of the USA? Only provide the name of the
    capital ','bedrock_claude35');
    -- Washington, D.C.
    

    Task functions #

    The following functions perform specific tasks such as sentiment analysis classification, grammar correction, masking, and translation using the given language model.

    To use these functions, ensure that the role you assume has the corresponding privilege at the starburst catalog level: Execute analyze_sentiment table function, Execute classify table function, Execute fix_grammar table function, Execute mask table function, or Execute translate table function.

    • analyze_sentiment(text, model) → varchar

    Analyzes the sentiment of the input text.

    The sentiment result is positive, negative, neutral, or mixed.

    SELECT analyze_sentiment('I love Starburst', 'bedrock_claude35');
    -- positive
    
    • classify(text, labels, model) → varchar

    Classifies the input text according to the provided labels.

    SELECT classify('Buy now!', ARRAY['spam', 'not spam'], 'bedrock_claude35');
    -- spam
    
    • fix_grammar(text, model) → varchar

    Corrects grammatical errors in the input text.

    SELECT fix_grammar('I are happy? What you doing.', 'bedrock_llama32-3b');
    -- I am happy. What are you doing?
    
    • mask(text, labels, model) → varchar

    Masks the values for the provided labels in the input text by replacing them with the text [MASKED].

    SELECT mask(
        'Contact me at 555-1234 or visit us at 123 Main St.',
        ARRAY['phone', 'address'], 'openai_gpt-4o-mini');
    -- Contact me at [MASKED] or visit us at [MASKED].
    
    • translate(text, language, model) → varchar

    Translates the input text to the specified language.

    SELECT translate('I like coffee', 'es', 'openai_gpt-4o-mini');
    -- Me gusta el café
    
    SELECT translate('I like coffee', 'zh-TW', 'bedrock_claude35');
    -- 我喜歡咖啡
    

    Use cases #

    You can combine vector search functions and Common Table Expressions (CTEs) to perform a RAG workflow with AI functions to enhance your results.

    The following example uses a CTE named vector_search and uses the vector search function cosine_similarity() to retrieve five documents that are most similar and asks the LLM to select the best chapter.

    The map_agg function aggregates the results from the vector_search CTE into a key-value JSON object. The aggregated map is explicitly cast to JSON. Then the prompt function sends the question, JSON formatted object, and CTE results to the LLM.

    -- Retrieve the top 5 most relevant chapters based on semantic similarity
    WITH vector_search AS(
        SELECT
            book_title,
            chapter_number,
            chapter_title,
            chapter_intro,
            cosine_similarity(
                generate_embedding('Which chapter should I read to understand how to
                balance the weight of a Boeing 747?', 'bedrock_titan'),
                chapter_intro_embeddings) AS similarity_score
        FROM iceberg.example-schema.faa_book_chapters
        ORDER BY similarity_score DESC
        LIMIT 5
    ),
    
    -- Augment the results by converting them into a JSON object
    json_results AS (
    SELECT CAST(map_agg(chapter_number, json_object(
        key 'book title' VALUE book_title,
        key 'chapter title' VALUE chapter_title,
        key 'chapter intro' VALUE chapter_intro)) AS JSON) AS json_data
    FROM
        vector_search
    )
    
    --Generate an augmented response using the LLM
    SELECT
        prompt(concat(
            'Which chapter should I read to understand how to balance the weight of
            a Boeing 747? Explain why. The list of chapters is provided in JSON:',
            json_format(json_data)), 'bedrock_claude35')
    FROM json_results;
    

    The output of the query:

    Based on the provided JSON, I recommend reading Chapter 9: "Weight and Balance
    Control - Commuter Category and Large Aircraft" for understanding how to balance
    the weight of a Boeing 747.
    
    Here's why:
    
    1. The Boeing 747 is a large aircraft that exceeds 12,500 pounds in takeoff
       weight, which is specifically addressed in Chapter 9's introduction.
    
    ...
    

    Vector search functions are often used for comparing the similarity or distance between embedding vectors, which is common for search and retrieval tasks. These functions can assist in determining how closely related two pieces of data are.

    Read more about the following vector search functions:

    Cosine Similarity

    Dot Product

    Euclidean Distance

    Prompt overrides #

    Prompt overrides let you modify the system or function prompt for task-based functions, giving you control over how the language model behaves and responds. Instead of adjusting user messages, Starburst recommends that you define the desired behavior in a system prompt, also known as a developer prompt, rather than in the user message.

    Example #

    Suppose the default prompt for a classification function is verbose, and you want to classify an email as either "spam" or "not spam". In this example, the input value is the email content, such as "Buy now and save 50%!", and the classifiers refer to the available classifications, which are ["spam", "not spam"].

    If the model outputs both a classification and an explanation, you can use a prompt override to refine the response.

    Classify "Buy now and save 50%!" as one of these: ["spam", "not spam"]
    
    You are a sentiment analysis expert. Please respond with "positive", "neutral",
    or "negative" only. The user will input the text to be analyzed and you will
    return the sentiment.
    

    Add to new model #

    You can add a prompt override for a supported function in the Advanced configuration options section of the Connect to an external model configuration dialog.

    Edit existing model #

    Follow these steps to edit an existing model and add a prompt override:

    1. In Galaxy, click the AI models icon in the navigation menu. There is a list or grid view of the available AI models.
    2. Click the options ︙ menu of the desired AI model.
    3. Click Edit connection.
    4. In the Advanced configuration options section, click Add prompt override.
    5. Select the function you would like to add the override to, and specify the parameters. You can add overrides to multiple functions at the same time.
    6. Click Save.

    General system prompts #

    General system prompts let you customize and improve the output of your language model by providing instructions or messages that define its behavior, role, tone, or context before any interaction takes place. These prompts act as a guiding framework for the system, shaping how it responds to your queries and conversations.

    Example #

    You can define multiple system prompts, each tailored to specific organizational requirements. For example, different departments can establish their own guardrails:

    • HR could add a prompt to ensure the model does not use or generate offensive language.
    • Security could add a prompt to prevent the model from revealing any personally identifiable information (PII).
    • Legal might define prompts to avoid generating content that could be interpreted as legal advice.
    HR prompt1:
    Never use or generate offensive or discriminatory language.
    
    Security prompt2:
    Do not reveal any personally identifiable information (PII), such as names, email addresses, or account numbers.
    
    Legal prompt3:
    Do not provide legal advice or interpret laws.
    

    Add to new model #

    You can add a general system prompt to a supported or compatible LLM in the Advanced configuration options section of the Connect to an external model configuration dialog.

    Edit existing model #

    Follow these steps to edit an existing model and add a prompt override:

    1. In Galaxy, click the AI models icon in the navigation menu. There is a list or grid view of the available AI models.
    2. Click the options ︙ menu of the desired AI model.
    3. Click Edit connection.
    4. In the Advanced configuration options section, click Add general system prompts.
    5. Select the function you would like to add the override to, and specify the parameters. You can add overrides to multiple functions at the same time.
    6. Click Save.