We describe how CrystalDB uses AI techniques to turn PostgreSQL into a self-managing serverless database
PostgreSQL is the world’s most advanced open source relational database, a title it has earned over many years thanks to the tireless efforts of its dedicated community. Keeping its lead in the age of AI will require accelerated innovation, some of it from outside the community’s traditional areas of expertise. This presents a challenge, but PostgreSQL’s users will benefit tremendously from its success.
There are two avenues for interplay between AI and system software such as PostgreSQL. “Systems for AI” describes traditional computer systems built to run AI applications, while “AI for systems” describes systems that work better because they use AI. We will describe both briefly, then focus on the latter—CrystalDB’s specialty.
Examples of systems for AI include vector indexing, which makes it efficient to search the high-dimensional spaces used by AI applications. In this case, the vector index is implemented using traditional programming techniques but built to enable AI applications. The world of systems for AI is broad. It includes machine learning frameworks such as PyTorch, platforms such as the Robot Operating System, and, of course, the GPUs providing the underlying computing power.
Among the well-publicized successes of AI for systems is Google’s 40% reduction in data center cooling costs, which was achieved using deep neural networks to save energy while protecting critical infrastructure from overheating. An even more impressive mechanical example is the self-driving car, with services like Waymo now operating in complex, safety-critical environments such as the streets of San Francisco. Extensive work has explored employing AI to improve operating systems, networking, cybersecurity, and, naturally, databases.
At CrystalDB, we are dedicated to developing self-managing capabilities to turn PostgreSQL into the world’s smartest relational database. This means it should run production workloads more reliably than any human database administrator ever could while providing good efficiency. Most importantly, a self-managing database (i.e., one that deploys AI for database system management) will eliminate burdensome and mundane administrative tasks, giving people time back for more productive endeavors.
Many organizations spend a lot of time worrying about their database infrastructure, digging to diagnose performance issues, and responding to critical situations. Despite these efforts, they rarely arrive at configurations they have full confidence in. They struggle to predict scalability and must always be ready for surprises. This is a sad state of affairs with significant consequences: slow or unavailable applications, harried teams that spend too much time fighting fires and too little time improving their products, and wasted capacity that drives up costs.
These problems have been with us for decades, and the candidate solutions have also. Look at what’s happening under the hood of PostgreSQL, and you will find numerous background processes, recovery mechanisms, and other algorithms designed to keep the system running smoothly without human intervention. Some of the tasks that commonly vex humans running PostgreSQL are at least partially automated on other database platforms. For example, tools are designed to advise the administrators of Microsoft SQL Server or Oracle database systems about adding or removing indexes to ensure optimal performance and efficiency. These tools run without customer supervision as part of the respective cloud offerings.
Still, efforts to automate the role of the database administrator have fallen short time and again. Ask experienced Oracle DBAs about automation features, and they will likely tell you that some are useful and that it’s best to turn others off in the interest of reliability and stability. For example, in the PostgreSQL ecosystem, the autovacuum feature can be a useful tool for automating necessary housekeeping. Still, it comes with 13 knobs that are difficult to configure, which is a consequence of it being based on heuristics rather than AI techniques.
Without a doubt, operating a mission-critical relational database today requires the attention of a skilled operator. Relational databases have hundreds of knobs that affect performance and reliability—over 300 in PostgreSQL— making it hard to know where to start. Even the most sophisticated DBAs engage in a game of guess-and-check to adjust many of these settings. Projecting performance is particularly challenging because problems such as lock contention can crop up intermittently and with little warning, yet with devastating effects. When choosing the right hardware, there are few guides save for intuition.
There are tools and templates for configuring systems to ensure availability. However, these do not provide ways to quantify the expected availability or help determine whether improvements are worth the cost.
We refer to the automated agent responsible for taking over system management responsibilities from people as an AutoDBA. In contrast to past systems optimized for performance measures such as throughput or latency, a modern AutoDBA must put reliability first. It is the key to making full autonomy possible. Previous attempts at automating databases, which have existed for years, have likely reached their limits—achieving full autonomy requires a fundamentally different approach.
The most important news in AI is rarely the subject of the breathless predictions blasted by pundits, nor is it the latest demonstration of generative AI. Much more important is the quiet success of automation in everyday life. For example, in San Francisco, it’s hard to go more than a few blocks without encountering one of Waymo’s robotaxis. These vehicles transport thousands of riders daily from place to place within the city, operating in a complex urban environment. With pedestrians everywhere—and often not on the sidewalks—the stakes are tremendous. We do not know whether robotaxis are safer than human drivers in all scenarios, but the data shows they are very safe.
Autonomous vehicles combine sophisticated AI systems, such as deep neural networks for perception, physics-based models of the real world, and catalogs of road rules and driving behavior into integrated systems. They also contain subsystems dedicated to ensuring safety. For example, their architectures may pair a complex AI-based system with a simple model-based checker system to enforce guardrails on behavior. Additionally, rigorous testing programs and safety engineering methodologies such as failure mode and effects analysis (FMEA) are critical to ensuring safety.
At CrystalDB, we combine modern database optimization techniques with the control systems technology and engineering methodology proven in autonomous vehicles and robotics. While developing practical self-driving vehicles has cost billions of dollars, we can develop robust self-managing databases much more cheaply. In contrast to the cost of crashing a car, CrystalDB can crash a database for the mere price of a system restart, making it cheap to generate data on failure scenarios and to do so near the limits of stable operation. We also do not need to develop novel laser sensors, for example, because we can measure most of what we need from PostgreSQL’s internal metrics. Other factors also benefit us, such as PostgreSQL’s code providing a system specification that is more precise than any model of the physical world.
As companies have migrated from on-prem hosting to the cloud, most have adopted managed database services. These services have automated some basic PostgreSQL administration tasks, such as installing the database software and underlying operating system, applying security patches, creating backups, and replacing faulty hardware. Put simply, these tasks are generic PostgreSQL administration and maintenance, and one can write a straightforward program that automates these processes for any PostgreSQL database used in any application.
Managed database services do not provide the workload-optimized configuration required to make PostgreSQL run reliably and efficiently under demanding circumstances. They do not help administrators get the 300+ configuration settings right or assist customers with understanding the trade-offs among performance, reliability, and cost. They do not describe the next scalability breaking point, nor do they help ensure that every query is running efficiently. Managed database services leave all of these responsibilities to human database administrators.
Self-managing databases pick up where managed services leave off, adding application-specific and workload-optimized automation to the generic administrative functionality. By leveraging safety and reliability techniques—in CrystalDB’s case, those borrowed from autonomous vehicles—self-managing databases can fully automate database administration.
Managed services provide limited functionality because the techniques developed for optimizing database configuration to the workload are imperfect. These techniques exist in tools such as pgAnalyze, which help suggest database configurations but require humans to review the configurations before they can be applied. Self-managing databases overcome these limitations because they are engineered to ensure safety and reliability, not merely performance.
We occasionally receive feedback from software engineers or DevOps engineers who fear introducing AutoDBA to their environments could cost their DBA colleagues their jobs. By contrast, the DBAs we have spoken with are enthusiastic and unfazed, most seeing an endless stream of projects coming their way. They would love to spend more time supporting the development of new products or improving data governance and less time troubleshooting operational issues.
We believe that CrystalDB and other self-managing databases lead us toward a “DBA-optional” world where most applications can run reliably without specialized database operator skills. Today, these applications are generally run by part-time DBAs: non-experts responsible for database administration and other tasks such as DevOps or application development. With self-managing databases, these applications will have rock-solid reliability and performance. Human experts can still squeeze out even better performance and perhaps cost savings by considering some riskier options that the AutoDBA cannot safely try without review. In the most sophisticated environments—those with DBAs today—the benefits from such optimizations will be the greatest.
There is a lot of hype to cut through regarding AI. For example, we are as enthusiastic as anyone about the potential of generative AI, but we do not believe it offers an overarching design principle for self-managing database systems. Instead, we are exploring targeted uses of the technology; for example, it can help reason about the intended purpose of application queries, facilitate communication between AutoDBA and humans, or create adversarial test cases. We are also committed to the targeted deployment of deep neural networks, using approaches similar to those already available in database tuning technologies such as OtterTune or Tencent’s DBbrain.
Cloud-native and serverless architectures strongly complement self-managing databases. One of the benefits of serverless databases is pay-as-you-go pricing, which, in principle, makes it cost about the same to run several independent small databases as a single large database. Such database fleets can be used, e.g., to support microservices or multi-tenant SaaS applications. The problem is that even if you have a system architecture that scales to database fleets, the benefit is limited unless administrative capabilities also scale. Unless all the databases are identical, some form of AutoDBA is needed to achieve the full benefits of serverless architecture.
Serverless computing also promises to make it easy to take advantage of the scalability and elasticity of the cloud. The need to tune knobs when the workload changes limits the ability of serverless databases to deliver on this promise. Our experience at CrystalDB suggests significant advantages to developing serverless and self-managing database technologies in concert because they benefit one another. For example, a self-managing database can deliver better quality of service when it can request additional resources on demand, and it is possible to develop a more appropriate scaling strategy with workload understanding—for example, choosing whether to scale up by using larger hardware or to scale out by partitioning or replication.
CrystalDB provides self-managing serverless PostgreSQL. Our approach is unique because we adapt safety and reliability architecture and design methodology from self-driving vehicles and robotics. Whereas other approaches to tuning and managing databases focus on optimizing performance metrics such as throughput or latency, we chose reliability as the primary objective for CrystalDB’s AutoDBA. Secondarily, our approach optimizes performance and cost, using both time-tested database optimization techniques and recent advances in AI.
The manual configuration required to run a production PostgreSQL database well remains unacceptable. It leads to tremendous waste in terms of lost revenue due to service slowdowns or interruptions, spending on excess resources, time spent troubleshooting and debugging, and time not invested in development tasks that add business value. The database field has pursued aspirations of full automation for decades, and we finally have the tools and technology we need to reach that goal.
Due to AI's widespread impacts, the database space will evolve greatly in the coming years. That evolution will bring great benefits to database users, but it also threatens incumbent leaders. There are already indications that open-source PostgreSQL can maintain its leadership in a world of systems for AI. At CrystalDB, we are committed to ensuring that it also leads in AI for systems.
Explaining the difference between self-managing databases and managed database services
Postgres promises simplicity and power that is perhaps too good to be true. To get the most out of your database you must manage its 300+ configuration knobs carefully.