Starburst Internal Reference

  • Overview
  • Markdown
  • Images
  • Personas
  • Tag list
  • Top ten
  • Personas used for documentation and design #

    There are three primary audiences we write and design for:

    • Starburst platform administrators
    • Data engineers
    • Data consumers: analysts & scientists

    There is a single, secondary audience: “Data leaders.” This persona is largely a check writer for our purposes (VP, CDO, CIO), and does not actually use SEP any differently than the primary personas. They are included here for completeness, and should be kept in mind when creating marketing content such as case studies, white papers and ROI-focused materials.

    This document describes these audiences as personas - fictional amalgamations - that you can empathize with and solve problems on behalf of. In practice, product personas are given names, faces and backgrounds to aid in discussing their needs as if they were real people, representative of our customers.

    Primary personas #

    Data consumers - analysts and scientists #

    Data analysts and scientists will approach Starburst in very similar ways. However, their backgrounds and skill sets are different, so we will separate those out. Their pain points will be treated together.

    Chris Consumer
    Chris Consumer
    Early career
    BSc in physics from University of Toledo, currently working on online MBA

    Chris is a Business Analyst. He's responsible for delivering visualizations and reports to ensure that his leadership is making well-informed, data-driven decisions. Chris cares very deeply that not only are the right questions being asked (and answered), but that the right data is being used to answer the questions. With the wealth of data available, it can be easy to overlook and misuse data. The quality of Chris's work ultimately rests on the quality and reliability of the data he uses, so Chris keeps good working relationships with his data engineering team and often communicates discrepancies and SLAs issues to them. Chris has some solid SQL chops and is often able to prototype a new data source to be productionalized by data engineers. Chris feels that he has just the right combination of technical skills and business acumen.

    When you write, design and build for Chris, here are some of the skill sets you can expect him to have:

    • A reasonable level of skill with SQL, with some knowledge of more advanced queries
    • Limited programming skills and methodologies
    • Tells stories with data
    • Expert with data visualization tools
    • Excellent spreadsheet skills, including some modeling
    • A good ability to detect and articulate issues with data, even if he cannot remedy them or trace the cause
    Cameron Consumer
    Cameron Consumer
    Early career
    PhD in statistics, Stanford

    Cameron is a data scientist. She's responsible for creating data models and that forecast and describe the business. Cameron worries about the impact of seasonality on sales, and feels compelled to deliver models that reflect that impact with a high degree of accuracy. Cameron feels like she brings the answers to "Why?" and "How?" to the table. Her machine learning models help her find the levers that the business can pull - the "how," and her models account for why the business behaved as it did, or will. She feels more like an academic than an engineer, and is very proud of her scientific approach to business. Her digital sales data knowledge is formidable, and her reputation as an SME ensures that she has a robust stream of opportunities in her field.

    When you write, design and build for Cameron, here are some of the skill sets you can expect her to have:

    • A reasonable level of SQL skills, with some knowledge of more advanced queries
    • Reasonable programming skills
    • Expert in statistical methods and/or machine learning
    • Some understanding of code repositories
    • Competence with data visualization tools
    • A good ability to detect and articulate issues with data, even if they cannot remedy them or trace the cause

    Cameron’s and Chris’s pain points include, in no particular order:

    • Having to retrofit tools onto multiple data sources
    • Long waits for ETL to deliver useable data
    • Can’t dive into data quality issues
    • Data engineers sometimes kill their queries because of resource contention
    • Complex, periodic reports and models are often delayed past due dates

    Data engineers #

    Donna Data Engineer
    Donna Data Engineer
    BSc in computer engineering, University of Illinois at Chicago

    Donna Data Engineer is responsible for designing performant data sources that can answer a broad range of business questions at XYZ, Inc. Donna found her way to data engineering through internships in college; it felt like a good blend between the technical chops required for programming jobs, and the big picture, organizational nature of data that she is naturally drawn to. As part of her job, she must understand what data is currently available from what sources, and what new data is needed to fill in any gaps. Donna has to work with stakeholders to source that new data, be it from third parties or through new log entries, message streams or product endpoints. Donna works pretty closely with data analysts and scientists, and tries to anticipate their needs in order to keep up with burgeoning data demands.

    Daniel Data Engineer
    Daniel Data Engineer
    BSc in computer science, University of New Hampshire

    Daniel Data Engineer is responsible for delivering data to data analysts and data scientists at Acme Corp. Up until a few years ago, this mostly entailed writing complex ETL in frameworks such as Informatica and Alteryx. Over the last few years, he's worked mostly in python-based frameworks such as Airflow and Bonobo as well as diving into Apache Spark. Daniel really cares about data landing times, because him and his coworkers hear from PagerDuty way more than they would like to.

    When you write, design and build for Donna and Daniel, here are some of the skill sets you can expect them to have:

    • Creating and monitoring pipeline health metrics to ensure SLAs are met.
    • Enabling automated self-service pipelines using Infrastructure as Code (IaC)
    • Design schemas, data lake and data warehouse solutions in collaboration with stakeholders.
    • Building and managing Kafka-based streaming data pipelines
    • Building and managing Airflow- and Spark-based ETLs
    • Creating and updating data models & data schemas that reduce system complexity and cost, and increase efficiency
    • Preparing and cleaning data for prescriptive and predictive modeling and descriptive analytics
    • Identifying, designing, and implementing internal process improvements such as automating manual processes, optimizing data delivery for greater scalability
    • Creating data tools for analysts and data scientists
    • Building data integrations between various 3rd party systems such as Salesforce and Workday

    Donna’s and Daniel’s pain points, in no particular order:

    • Keeping up with the changing landscape of data delivery technology
    • Managing SLAs for data pipelines in environments where the data growth rate and complexity constantly increases, data pipeline and platform performance
    • Aligning and negotiating with upstream data sources and infrastructure SLA owners
    • Sussing out detailed data requirements from folks with a wide range of data knowledge
    • Long, brittle pipelines
    • Productionalizing non-performant analyst queries
    • Constantly responding to resource constraint issues
    • Designing ETL around siloed data
    • Data cleansing

    Platform administrators #

    Art Administrator
    Art Administrator
    Late career
    BSc in computer science, BYU

    Art Administrator is responsible for XYZ, Inc's Starburst cluster. He was an SRE for the data team for years, and switched roles to platform engineering after leading the SREs for a bit. Art really cares about scalability and reliability, especially since XYZ has super aggressive SLAs both on data landing times and of course availability. Art works closely with his colleagues in IT to ensure that his systems adhere to XYZ's strict access policies and support audit requirements.

    Ada Administrator
    Ada Administrator
    Late career
    BSc in computer science, University of Washington

    Ada Administrator is responsible for both Acme Corp's Starburst and Postgres clusters. Ada was a DBA from early to mid-career, and it fell to her at Acme to figure out the HDFS ecosystem when it came along. Now she builds and maintains big data clusters for a living. Ada cares a lot about the using right data platform for the data.

    When you write, design and build for Art and Ada, here are some of the skill sets you can expect them to have:

    • Building and maintaining scalable data platform architectures to support the ingest, storage and querying of large heterogenous datasets
    • Creating and monitoring cluster health metrics to ensure optimal performance and reduce any downtime
    • Writing clean, production-ready code (in Java, Go etc.) with a strong focus on quality, scalability and high performance
    • Using and building scalable asynchronous REST API’s
    • Working with cloud providers like AWS, Azure and Google Cloud
    • Implementing and working with persistence technologies like AWS S3, HDFS, Kafka and ElasticSearch
    • Designing for data integrity and security through all environments as well as the data lifecycle
    • Partnering with data engineers to enable automated self-service pipelines using Infrastructure as Code (IaC)
    • Partnering with data engineers to design and improvement schemas, data lake and data warehouse solutions in collaboration with stakeholders

    Art’s & Ada’s pain points, in no particular order:

    • Sorting through an overload of information to master complex data platforms
    • Ensuring data platforms can scale to demand and with growth
    • Architecting solutions that can provide disaster recovery and business continuity for complex, critical data systems, in conjunction with IT stakeholders
    • Assisting in managing budgets and licensing cycles for massive enterprise-scale software vendors, bandwidth and hardware leases
    • Constantly tackling inherently complex and highly-visible tasks
    • Delivering against stringent infrastructure SLAs
    • Doing more with less, or at least the same team size
    • Implementing data governance requirements for all data systems

    Secondary persona - data leader #

    Lauren Leader
    Lauren Leader
    Mid-to-late career
    MBA, Haas School of Business

    Lauren is CIO at the newly IPO'ed Clouds 'R Us. She's responsible for data infrastructure, data governance and delivery, as well as enabling SOX, GDPR and CCPA compliance. Prior to stepping into her current role, Lauren was a VP of IT at Acme Corp., where she owned the budget for all data infrastructure. She calls this her "real-life MBA," because she learned the hard way from being caught off-guard by explosive growth in under-specified legacy systems in multiple budget cycles. Lauren is also sensitive to scaling, platform lock-in, and staffing around particular technologies.

    When you write for Lauren, here are some of her pain points to keep in mind, in no particular order:

    • Constantly fighting Shadow IT, up to and including small, narrow-scope one-off data warehouse solutions which she inevitably must absorb
    • Architecting around legacy systems, particularly monolithic services
    • Changing regulatory climate
    • Staffing for innovation while keeping legacy systems running and trying to automate
    • Managing, defending and demanding a budget with rapid growth
    • Balancing buy-vs-build, including for contracting services
    • Balancing private cloud vs hosted cloud solutions for cost-effectiveness, regulatory compliance and security