赞
踩
Welcome the Senior Vice President of NWS Utility Computing, Peter DeSantis.
Good evening. Welcome to Monday Night Live. As many of you might know, we like to do things a little differently on Monday night. But for those of you that don't know, just remember the five Ls - Loggers, Laughter, Launches, Learning, and Loggers. Let's do it.
Ok. Last couple of years, we've been talking about some of the largest serverless services that we build on top of AWS' massive infrastructure - S3, Lambda. If you include DynamoDB, these are the canonical examples of delivering on the promise of serverless computing. What is that promise? Serverless promises to remove the muck of caring for servers. No need to upgrade software, patch operating systems, retire old hosts, qualify new hosts. With serverless, it all goes away.
But this is just the beginning. Every year, I like to show a version of this slide. And these are the six most important attributes of a cloud computing service. These are the things that we spend a ton of time putting into all of our services. But serverless services let us take these things a step further.
For example, service is more elastic because we run over our vast infrastructure and share our capacity across a large number of customers. Each customer workload represents a small fraction of the capacity and serve. This is more cost effective. Unlike a service where you have to pay for what you provision with a CUS capability, you only pay for what you use.
And serverless computing allows us to run our infrastructure more efficiently. And that means better sustainability because the most efficient power is the power you don't use. And because these services are built from the ground up to run on AWS infrastructure, they deliver better security and better availability, taking advantage of native capabilities like AWS Nitro and our Availability Zone architecture.
So why isn't everything we do serverless? Well, I think there's a couple of reasons for this. The first is familiarity of legacy code. One of the things we know is that over the long term change is inevitable, but change over the short term is hard. If you doubt this, consider the mainframe - it's not that developers love writing and maintaining applications on the mainframe or the innovation of the mainframe that keeps people there. It's because it's hard and expensive to move things off these systems. And even in less extreme examples, developers need time to learn new systems and approaches.
The second reason is richness of capability. To deliver on the promise of serverless capabilities, we deliberately introduced more targeted product offerings initially. For example, at launch DynamoDB offered high performance reads and writes that with minimal query semantics useful for a broad range of applications. But a far cry from what a traditional SQL database could do at the time.
Now, we started off pretty simple, literally with Simple Storage Service and Simple Queue Service in 2006. But we've been working really hard over the years to add serverless capabilities. In fact, we've accelerated our pace of innovation and some of these features and capabilities change the way these services can be used entirely.
For example, with DynamoDB in 2018, we added the ability to have transactions and this made DynamoDB a much better replacement for a traditional relational database. And of course, we launched completely new capabilities including EFS our Elastic File System, Lambda, which pioneers serverless computing as a service and Fargate, which allows you to run serverless containers.
Now, there's amazing stories of tactical innovations underpinning all of these capabilities. But tonight, I want to focus on a different story. One of the things that I love about working at Amazon is we reject the need to only pursue a single path. When you push us in different directions, our normal response is "Let's go!"
And while customers love our service offerings, they also love when we innovate around the tools and software that they use today. And that's why we support the broadest range of managed databases, file systems, operating systems and open source software. We're committed to making sure that AWS is the best place to run any of the software that you need. And this commitment is why we've been making a large investment in delivering the value of serverless computing to the server full software that you love. And that's the journey that I want to go on tonight.
And what better place to start this journey than with the relational database? Historically, working with a database involved picking an instance configuration, attaching onto the instance, installing a database, setting up the database, probably configuring complex replication schemes. And don't forget the daily joy of patching and maintaining the database or the occasional thrill of upgrading the database to a new version.
And that's why in 2009, we announced Amazon Relational Database Service. As we described in the launch post, RDS's goal was to make it easier to set up, operate, and scale a relational database in the same way that EC2 removes the muck of managing an instance. RDS removes the muck of running a database.
But how do you go from a managed database offering to something serverless? With years of innovation, and now there's a lot of innovations underpinning Aurora, which provides a fully compatible PostgreSQL compatible and MySQL compatible database. But the biggest innovation of Aurora is Aurora's internal database optimized distributed storage system. Something we internally refer to as Grover.
Grover allows us to disaggregate our database from the storage itself. Now at first blush, this might not sound all that impressive. RDS uses EBS and EBS is disaggregated storage, right? True. But EBS just provides you with the ability to configure a better instance for your database. It's nice but Grover does a lot more.
This diagram represents how most modern relational databases are built. Aurora and Grover are built in much the same way. But the focus of their architecture is around one component and that component is the log.
The database log is an essential element to the capabilities that we expect from a relational database and perhaps less obviously to the performance of the database itself. The log is a meticulous record of everything that's happened inside the database. Everything else is just a manifestation of that log that allows you to run queries quickly and transactions efficiently.
Rather than assure that every modified memory page is immediately synced to durable media, the database engine carefully logs every step it takes using a technique called write ahead logging. And this log liberates the rest of the database engine to focus on performance, not needing to worry about maintaining consistency and durability, things that we treasure in our database.
The log can also be used to restore the database at any point in time. If you have the log, you have the database. And with Aurora, Grover has the log. Rather than logging locally, Aurora databases send each of their log entries to Grover and Grover immediately assures the durability and availability of those log entries by replicating them to multiple Availability Zones.
But this is only part of what makes Grover so powerful. It doesn't just store the log. It actually processes the log and it creates an identical copy of the database's internal memory structures on the remote system. And these data structures can be sent back to the Aurora database anytime they need it so they can be loaded into the database's memory.
Now, the primary benefit of this is that it significantly reduces the I/O on the main database. Unlike a traditional database, Aurora no longer needs to write its dirty memory pages to disk to durable memory. It only needs to log to Grover and writing a log involves a relatively small amount of sequential I/O, something that can be done quite efficiently. In fact, Grover can reduce the I/O demands of the Aurora database storage system by 80%. And that's why with Aurora, you get 3 to 5 times price performance over the equivalent open source managed databases.
Grover also provides the durability of multiple Availability Zones without needing to set up database replication for application. If your Aurora database or even a whole Availability Zone goes down, you can relaunch your Aurora database in any of the other Availability Zones.
Aurora also allows you to easily and efficiently scale out by adding read replicas. And of course you get serverless scaling of your database storage because each Aurora database has access to Grover's multi-tenant distributed storage service. It can scale seamlessly and efficiently from a single table to a massive database. And when the database gets smaller, that's taken care of too - drop a large index, stop paying for the index. That's how Grover works.
With our launch of Aurora, we took a big step forward on our journey to making the relational database less serverful and more serverless. But it's still a far cry from a real serverless service. For example, while Aurora helps you easily scale out the read capacity of your database by adding read replicas, you still need to update the primary database. If you need more write capacity, an upgrade or downgrade of a server size requires you to fail over your database. And that is not very serverless dude.
So that's what led us to launch Aurora Serverless - Aurora scales up and down seamlessly as your database load changes without needing to resize or fail over the database. Now, how do we make an elastic relational database that can grow and shrink without a failover?
One obvious way to do this is to run the database on a very large physical server and let it grow as needed. And databases are good at this, they're used to asking the operating system for more memory when they need it. In our example here, our database is running on a physical host with 256 gigabytes of memory. When it needs more memory, it simply asks the operating system. And this is great, except with all that extra memory sitting around, there's going to be a lot of waste. So that's not gonna work.
So we could try running multiple databases on that same large server allowing each database to grow and shrink as necessary. And this will make the shared pool of resources more efficient. But there's a problem - this involves sharing. And as we've discussed in the past, sharing is complicated at AWS. We believe the only way to share server resources securely is with a hypervisor. Others may have different views on this. But for us, processes simply aren't an adequate security boundary. We also don't consider containers, which are really processes under the covers, to be adequate ways to isolate workloads. Our only abstraction for isolating customer workload is to use a hypervisor.
Our Nitro hypervisor provides purpose-built capabilities to deliver consistent performance on our EC2 instances. Let's look at what happens when you create an 8 gigabyte instance with a Nitro hypervisor. The hypervisor allocates 8 gigabytes of physical memory to the instance and the guest thinks it has 8 gigabytes of memory. When you launch an EC2 instance, you can rest assured that the Nitro hypervisor is reserving the memory and CPU resources that the instance comes with. This is why EC2 instances provide such consistent performance regardless of what other instances are doing. And this is great for a database. Databases love consistency.
But what happens when a database requests more memory? Well, the OS simply doesn't have anything to give. Even though there's available resources on this host, the guest is already configured statically with the smaller memory allotment. So our only option to grow this database would be to reboot and that's not much better than getting a whole new instance and failing over.
So Nitro isn't going to help us. We need an entirely different approach and that is why we built Caspian. Caspian is a combination of innovations that span a new hypervisor, a heat management planning system, and a few changes to the database engine itself. Together, these innovations enable Aurora Serverless databases to resize in milliseconds in response to whatever changing load in the database happens. And we use an approach that we call cooperative oversubscription.
A Caspian instance is always set up to support the maximum amount of memory available on the host that it's running. In our example here, that would be 256 gigabytes. But unlike Nitro, these resources are not allocated to the hypervisor on the physical host. Instead, physical memory is allocated separately based on the actual needs of the database running on the instance. And this process is controlled by the Caspian heat management system.
So let's see what happens when we add a database to our instance here. We see a single Caspian instance running on our host with 256 gigabytes of memory. The instance only needs 16 gigabytes of memory to run its database. So it's asked and been granted that memory by the heat management system.
The important thing to take away here is that the database is running on an OS that believes it has 256 gigabytes of memory. But under the covers, we're only using 16 gigabytes of memory. Just like on our original process based example, Caspian can run multiple databases and allow them to efficiently share the resources of the underlying host. But unlike our original example, with Caspian, we get all the security and isolation of a hypervisor.
So this seems to work great. But what happens when our databases need more memory? The Caspian heat management system is responsible for managing the resources of the underlying physical host. When a database wants to grow, it must first ask the heat management system for resources. And when additional resources are available, the heat management system can simply say yes and the database can instantly scale.
But what happens when we run out of memory? In this case, the Caspian heat management system replies with "please wait" and then it proceeds to migrate one of the Caspian instances to another physical host with available capacity. Doing this migration quickly is enabled by high bandwidth, low jitter networking provided by the EC2 instance. And it results in almost no performance impact to the database while it is being migrated. And after the migration is complete, our database can scale again.
Now, of course, the best time to have resources available is when you need them. And so the Caspian heat management system is actually constantly predicting which databases are going to need memory, optimizing the fleet.
"Here. You can see a portion of our production Caspian fleet scaling up and scaling down as load changes. And you can see things are changing all the time but the heat stays balanced across the fleet. So Caspian allows us to provide a scalable database and run our infrastructure efficiently.
Now, we're getting pretty close to serverless, but are we there yet? We haven't quite reached our destination. What happens when our resources required extend beyond the limits of the physical host that we're running on. Well, there's nothing we can do. We're still limited by the size of the physical server and that's not serverless database.
Sharding is a well known technique for improving a database's performance beyond the limits of a single server. It involves horizontally partitioning your data into subsets and distributing it to a bunch of physically separated database servers called shards. To shard a database, effectively, the goal is to identify a way to partition your data such that all the data needed for frequent accesses resides on one shard. In this way, the shard is able to execute the transaction locally and as efficient there was a monolithic database.
With a little thought in your data schema and application design database sharding offers a powerful tool to eliminate the scaling limitations of a scale up database and remove the limits of a server. However, there's a bunch of operational complexity involved in managing a sharded database.
First, you need to write your own routing and orchestration layer. Next, you need to set up and manage all those shards. And if you need massive scalability, you might be managing dozens or even hundreds of shards. And with most applications load is not uniform across all the shards. So we need to worry about scaling each of these shards up and down based on load. And at some point, you're likely to need to repartition the shards and moving this data around while the database is operating is a complicated operational task.
Finally, things get really complicated when you need to make transactional changes across multiple shards. So we've asked ourselves, what would database sharding look like in a serverless world? And that's why tonight, I'm excited to announce a ROL limitless database.
A ROL limitless database makes it easy for customers to scale their database beyond the right throughput of a single server. With limitless database, there's no need to worry about the routing of your queries to the correct database shard. Your application just connects to a single endpoint and has the the scalability of a sharded database.
A ROL limitless database automatically distributes data across multiple shards. And you can configure a ROL limitless database to co-locate rows from different tables in the same shard to minimize having to query multiple shards and maximize your performance.
But unlike common sharding approaches, a ROL limitless databases provides transactional consistency across all your shards for peak performance. You still want to localize transactions on shards as much as possible. But as I'll show you shortly, a ROL limitless databases uses a unique approach to making these cross shard shard transactions perform very well.
This probably sounds too good to be true. So let's have a quick look at how it works. As I mentioned earlier to distribute each of your queries to a shard of database, you need a routing and orchestration layer. So of course, we built one of those and with a ROL limitless database, we made a couple of important design decisions.
First, we designed our routing layer to require as little database state as possible. Our routers only need a small bit of slowly changing data to understand the schema of the database and the shard partition scheme. And keeping this layer lightweight means that we can scale quickly and it allows us to run across multiple availability zones efficiently providing high availability without the need for customers to manage complex replication.
Second, each of the routers is actually in a ROL database. So we can orchestrate complex queries across multiple database shards and combine the results allowing you to run distributed transactions across your entire sharded database.
Now, the second big challenge of operating a sharded database is managing all the shards because load varies across the shards. In traditional shard databases, this requires considerable operational work to optimize performance and cost.
Fortunately, every one of the limitless database shards runs on Caspian. And this allows each shard to scale up and down as needed to a point. What happens when we get to the largest database that we can support on a Caspian server? We've been here before. Well, fortunately, we have a better option than we do with a non-sharded database. We can split our shard into two new shards and this is easy to do because Rover makes it easy for us to clone our database and repartition. And once created, we can use our router fleet to easily and transparently update the routing layer without the database clients seeing any change at all.
Now, things are getting really interesting. But we have one more thing to think about. We started off our discussion of relational databases this evening talking about how important an ordered log is to building a high performance relational database. But how can we do this on a shard distributed database?
On a single server, it's pretty easy and efficient to maintain a sequence number and use it to order everything that's happening on the database. But how can you accomplish this on a distributed database? Well, it turns out you have a few options.
The first option is to have a single server maintain a sequence number and have all the databases coordinate with this server. But this is going to slow things down quite a bit and it's definitely not going to scale.
So a second option is to use a logical clock. A logical clock is basically a counter that gets passed around and incremented every time two servers interact. Logical clocks avoid the scaling limitations of a serialization server. But these distributed logical clocks are very different than a simple sequence number. And to implement a traditional relational database on top of a logical clock would be quite an undertaking.
So fortunately, there's a third option and that's to use wall clock time. If we could have a synchronized clock across all the servers in our sharded database, then it would be easy to establish the order of events by simply comparing time stamps.
Now, this sounds like a promising solution. Unless you've spent time working with clocks, the average clock in a server drifts by about one second per month. Some will gain time, some will lose time, some will drift a little less, some will drift a little more.
Now, a second in a month may not sound like that much, but it's more than enough to make the clock pretty much worthless as a sequence number. And of course the solution to this is to sync your clocks. And that's why five years ago, we launched Amazon Time Sync Service to help AWS users sync clocks.
Time Sync provides an easy way to keep clocks accurate to within a millisecond. So how would this sort of accuracy, one millisecond, help with our database ordering problem in a distributed system?
If you want to use a clock to order actions, you're constrained by the accuracy of your clock. You actually have to wait until you're certain that your local clock is ahead of all the other clocks in the system. And it turns out that this actually requires you to wait for twice the amount of time that your clock could be inaccurate because you have to account for some clocks being faster and some clocks being slower.
So with our one millisecond clock sync, we can only order 500 things per second. And that's not a very high number when you're trying to build a high performance database.
So a few years ago, we asked ourselves if we could find a better solution to synchronize clocks. And this is the third innovation underpinning a ROL limitless database. We've changed the database to use wall clock to create a distributed database log that achieves very high performance. And it's made possible by a very novel approach to synchronizing clocks.
Syncing clocks sounds like it should be as simple as one server telling another server what time it is. But of course, it's not that simple because the time it takes to send a message from one server to another server varies. And without knowing this propagation time, it's impossible for those clocks to be synced with great precision.
So time sync protocols calculate this propagation by sending round trip messages and subtracting the time spent on one server from the time spent on another server. And this sounds easy enough but there are caveats.
The first thing to understand about these clock sync protocols is they work over the same network that you're sending your data. They're running on the same operating system using the same network cards traversing the same network devices running over the same network fibers. And while all of these things can create variability in the propagation time and while it's small variations, it will impact how closely you can synchronize your clock.
Additionally, these protocols rely on being able to update the timestamps in network packets at the very instant they leave a server or switch and most hardware is not optimized to do this. So this too introduces variability.
Now, while the downsides I just discussed make things hard, the reality is you can do a pretty good job of syncing your clock on a small network if you're willing to devote a significant amount of your network to doing so. But it gets really hard to run these protocols at regional or global scale.
So we decided to do something a bit more custom and we were inspired by how clocks are synced in some of the most demanding environments like particle accelerators where having clocks synced as closely as possible all the time is a must.
It might not surprise you to hear this all begins with Nitro. After all, many of my favorite stories begin with Nitro. Nitro is the reason that AWS got started building custom chips and it remains one of the most important reasons why AWS is leading the way with respect to performance and security in the cloud.
One of the things that Nitro enables that would be really hard and expensive to do without Nitro is that we can add very specialized capabilities to instances at low cost. And that's exactly what we did here. Our latest generation Nitro chips have custom hardware to support accurately syncing their local clock based on a time pulse delivered by a custom designed time synchronization network.
Here you can see a picture of one of our time synchronization racks. At the top of the rack is a specialized reference clock that receives a very precise timing signal from a satellite based atomic clock. Now, these reference clocks provide incredible accuracy. They can provide a synchronized clock anywhere in the world to within a few nanoseconds, that's a clock accurate to billionth of a second anywhere in the world.
Each of these racks also has a local atomic clock to keep time in case that satellite is momentarily unavailable. And each of our availability zones has multiple of these time distribution racks.
Now at the bottom of the rack, you can see the specialized time synchronization network that distributes this timing pulse. Let's have a look at that. This is one of those time sync appliances and in the middle, you see a Nitro chip and on the right, you see an FPGA. Together these are used to implement our time synchronization network.
But what you don't see on the slide is a network stack and that's because these devices don't route packets. Instead they do one thing and they do one thing only. And that is synchronize clocks.
In combination with the specialized Nitro cards in our EC2 hosts, this network distributes a timing pulse directly to every EC2 server and every step of this distribution is done in hardware. There's no drivers or operating systems or network buffers to add variability.
How cool is this a custom designed network for synchronizing clocks? Well, it's cool enough that we couldn't keep it to ourselves. And so a couple of weeks ago, we announced a new version of Amazon Time Sync.
Amazon Time Sync now gives you a way to synchronize to within microseconds of UTC on supported EC2 instances. That's a clock that's accurate to millionth of a second anywhere in the world. These accurate clocks can be used to more easily order application events, measure one way network latency and increase distributed application transaction speed.
And of course, with a ROL limitless database, this means we can support hundreds of thousands of ordered events per second. And this is why we're able to run those distributed transactions so efficiently.
But our journey is far from over. Relational databases are not the only serverless things that we're investing in reinventing. Caches are another powerful tool and they're used for improving latency and are also critical to cost effectively scaling your services.
Amazon ElastiCache is a managed caching service, much like RDS. ElastiCache minimizes the muck of managing popular caching applications like Redis or Memcached. And while ElastiCache greatly reduces the work in managing a cache, it's not very serverless. In fact, the first thing you do with ElastiCache is you select an instance because unsurprisingly caches are very tied to the servers that host them. And that's because the performance of the cache relies on the memory of the server.
If your cache is too small, you're going to evict data that is going to be useful. And if your cache is too large, you're wasting memory on, you're wasting money on memory you don't need. Like Goldilocks searching for the perfect porridge finding the perfect cache is an elusive goal.
Typically you end up provisioning for peak to assure that you have enough memory when you need it most. And this means that most of the time you're likely running on too large a cache and wasting money. But with a serverless cache, you wouldn't need to worry about this at all.
Well, today, I'm happy to tell you, you have a serverless cache. Amazon ElastiCache for Redis is now serverless with ElastiCache on-demand. With ElastiCache on-demand there's no infrastructure to manage or no capacity planning to do. We get it right on your behalf.
A critical feature of any cache is speed and we've got it covered there as well. The median latency of an ElastiCache lookup on-demand is about half a millisecond and there's great outlier latency as well. An ElastiCache on-demand cluster can scale well beyond what any single server could possibly do - five terabytes of memory.
So I bet you think I'm gonna tell you how this works. I would accept that you already know. One of the things that I find exciting about building infrastructure in AWS is that when we solve an interesting problem, we can also often use the solution to solve problems in other places. And Caspian's ability to right size in place is exactly what we need.
Now hopefully we recognize this diagram and if you don't, it's probably time to lay off the extra latte you took on the way in. But what we're looking at here isn't Aurora Serverless v2 Databases."
It's Aurora Circulus with the cash shards running on Casbin rather than the database shards. But everything else here works the same. The shards can grow and shrink, and the Caspian heat management service works to maintain their magic to keep the fleet well utilized.
And of course, there's one difference here which is the request routing layer with the last of cash. The key to this routing layer is assuring it's really, really fast because we can't afford to add latency to any of the shard cash requests. And that's exactly what we did.
Customers are excited about the performance and agility of Elastic Cash, Cerberus. We know that removing the undifferentiated, heavy lifting and muck is critical to enabling our customers to innovate. And one of my favorite parts of Monday Night Live is hearing from these customers.
One of those customers is Riot Games who are making use of several AWS services to power their new experiences for gamers globally. To tell us more about their journey on AWS, please welcome Brent Rich, Head of Global Infrastructure and Operations at Riot Games.
Brent: Thanks Peter. Hello everyone. As Peter said, my name is Brent Rich with Riot Games. I was just in South Korea at a League of Legends World Championship that you saw at the end of the video and it still takes my breath away. It is my honor to be up here today to talk about our journey behind what you just saw and how with AWS, Riot was able to supercharge our purpose to make it better to be a player.
Now at times, the story may seem like it's jumping back and forth and things were happening all at once and that's because they were. But with the help of AWS, we ended up doing it all, even when we had to change timelines or pivot priorities. And spoiler alert - it happened a lot.
For me though, it all started 5.5 years ago when I joined Riot Games. The thing is back then, we were a single game company with all our eggs in one basket - League of Legends. And back in 2009, when we launched, it was completely reliant on co-located data centers, which we managed because let's be honest, we didn't trust that anyone else could meet our bar for making live games great. And that worked for about a decade.
But by 2017, it was taking forever to get things done. And we knew that if over 100 million players worldwide didn't see Riot investing in and loving our own game, they would leave even if it is free to play. So we started looking at options and we quickly settled on going all in on cloud.
That meant we would migrate to AWS, which had the broad set of services that we needed to run Riot. We also decided that all new things would be born in the cloud and for us, that meant new games.
Now let's pause for a second because if you didn't notice my intro stated that I work at a place called Riot Games, plural, right? Well, Valorant was the next big opportunity in early 2020 to collaborate on with AWS - the global launch of Valorant, a tactical shooter game with very specific design goals. And one in particular presented a unique challenge which we were able to address thanks to AWS global cloud infrastructure - peer's advantage.
Peer's advantage occurs when a player in motion can push the corner knowing there is a split second delay for the defender on the other side, which allows the peer to see the defender first. And one way to address this in first person shooter games is by adding more places for the defenders to hide. But that increases the luck and chance of the game versus it being on the player's skill alone. And to us that just feels bad.
So we made it a priority to mitigate this in Valorant and we determined that at highly competitive levels of the game, if server tick rate is at least 128 per second, which requires a ton of compute, and network latency was under 35 milliseconds to players requiring all those AWS locations - if those two things happen, peakers have no advantage.
And if you're not familiar with how this technically works in the game, player actions and movements are sent to the server, the server then updates and sends the simulation state of the 10 players back to all of them. And it does this 128 times per second across the internet. It's this speed of updates between all 10 players to the server and back again that effectively mitigates peer's advantage and makes playing much more fair, which we hope is more fun as a result.
And by using AWS, Riot was able to launch Valorant globally with low risk. We were able to provision a bunch of cloud capacity all over the world with no long term commits, leveraging AWS regions, local zones and outposts. This meant that if the launch didn't work out, we would simply shut it down and fail fast.
And so after all of that, 2020 was great, right? Not quite. Unfortunately, while we did have massive success putting Riot Games plural, the pandemic was causing havoc in the world of esports. All our competitions like Worlds had ground to a halt and while games were doing great, esports had no new content for them. So we had a new problem - how could we reinvent remote broadcast and eliminate the need for onsite staff sitting in cramped trucks?
Of course, AWS had a solution for that. We ended up enabling video encoding and production to happen in the cloud with our folks accessing it through AWS WorkSpaces from home. It was pretty wild - from the proposal to rollout, it ended up taking all of 11 days.
And today we take it even further. We leverage AWS to remotely produce events all over the world, all from our remote broadcasting centers in Dublin, Ireland and Seattle, Washington. This allows producers, editors and casters to be in one place while all our events are at another.
Ok, so now that we had esports back producing world class competitions and Valorant was successfully launched and doing amazingly, it was time to monetize League and migrate it into the cloud. But there were some challenges.
Like first, we had to figure out how to safely deploy, configure and test 30+ microservices in a world where every service team was used to managing their services independently and how they wanted.
And second, using Amazon EKS out of the box just didn't meet our needs because some of our games can run on average 35 minutes and we couldn't just pause or take a container out of service within 15 minutes if required for AWS maintenance.
So we made a product ask and AWS delivered a short term solve and then followed through with a long term solution for us. But we also knew there would be some additional expected benefits along the way.
And one was uptime - in the old days when everything in a Riot data center failed in an unexpected way, it was often a large outage that lasted 1 to 3 hours. But once we got up and running on AWS, those outages instead turned to hiccups that players barely notice.
And the other one was visibility. Anyone here ever spent an unreasonable amount of time trying to figure out what you have, how it's configured and who consumed what for who? So did we. But with AWS, retrieving this data is pretty much now all an API call away.
And so with all these challenges and benefits, what did we get for it? Well, we did migrate 14 data centers. We modernized the decade old game for hundreds of millions of very loyal players around the world. We basically rebuilt the plane in flight. We modernized broadcasting and we launched several global games in the cloud over 36 months. It was a lot.
Ok, as I wrap this up, I'd like you to remember that Riot didn't actually do anything groundbreaking here. When there were problems, we looked for solutions and it just so happened that AWS had them for us.
And so a quick tip - make the ask. It doesn't matter what size company you are. If it makes sense for the broader customer base, AWS just might do it. Or like us, there might be a short term solve and one final takeaway - when you're looking at cloud and whether it serves your needs, don't assume what was true 6 months ago is true now. Cloud moves very fast and it's always changing.
I know for a fact that we would have saved quite a few headaches if we were willing to re-evaluate more often versus pursuing our own solutions. So please keep an open mind. And with that, thank you very much for listening to our journey and I can't wait to see what AWS comes out with next.
Peter: Thank you, Brent. It's exciting to see the growth of Riot and how we work together to enable amazing experiences for gamers. But we're coming to the end of our journey, we have one more stop and it's a big one. Let's look at something that has a reputation for needing the biggest servers around - the data warehouse.
Data warehouses are specialized databases that are designed to work over massive data sets. They serve a large number of users and process millions of queries that range from routine dashboard requests to ETL processes to complex ad hoc queries.
And traditionally, they have some of the biggest servers available to ensure they have the resources they need for this business critical workload. And if you're taking care of a database, it's not just picking the right server and storage, you need to worry about data warehouses.
They come with a ton of knobs and options that can be used to optimize the performance of your data warehouse and achieve your cost objectives. And all of these knobs need to be manually tuned and retuned to achieve optimal results, which is why we challenged ourselves to remove this undifferentiated heavy lifting for you so you can focus on getting business value out of your data warehouse and not on getting a PhD in database systems management.
That's why we launched Redshift Serverless in 2021. It automatically scales based on workload, optimizes the data layout of the data warehouse and automates common data management operations. And while we're happy with our early success, our most demanding Redshift customers have told us that for some workloads, they still need to intervene.
Let me show you why. A day in the life of a large production data warehouse might look something like this:
Most of the time, the data warehouse is running small well tuned queries to help with routine business processes and serve reports and dashboards. It's important to make sure those queries happen quickly and with predictable latency - nothing is worse than having the boss's favorite dashboard load slowly.
And there's usually large periodic workloads like ETL jobs that happen hourly or maybe daily. And these need to be carefully managed to avoid interfering with those smaller workloads and to make things even more interesting, every once in a while, some really smart data scientist runs a massive unexpected query that takes forever to finish and bogs everything else down while it's running.
So let's look at how Redshift Serverless manages these challenges today:
Redshift Serverless scales based on query volume and in a vacuum, while the queries are similar, this works really well. Here, you can see a number of new queries coming into our data warehouse. Each query is a circle, bigger circles are more complex queries and the container at the bottom is the preconfigured Redshift capacity unit called an RPU.
The base RPU is set by the database administrator and is a unit of capacity that Redshift Servers will use to scale your data warehouse. Currently, we see a bunch of small and medium sized queries coming in and being run successfully. But what happens when our load increases?
Redshift Serverless uses reactive scaling - when a number of concurrent queries gets above a certain threshold, additional capacity is provisioned. However, it takes time and this adds additional latency while the database is setting up additional capacity. And while this barely matters to long running queries, it can make a big difference to those short running queries. Let's hope none of those queries are the boss's dashboard!
Once the new cluster capacity comes online, everything's back to healthy. But as I mentioned, queries aren't uniform and there's no magic baseline capacity that will optimize for all the work that a data warehouse will see.
Here's our Redshift Serverless cluster again and everything's running quite well. But what happens when that big ETL job starts? It ends up in the same capacity as the boss's dashboard and that's not great because it's a big query and everything's going to slow down a bit while it runs.
And even though everything's going slowly, the data warehouse doesn't scale up because we haven't passed our query threshold. How do customers deal with this today? Well, with Redshift and other data warehouses, many customers create a second data warehouse to separate their jobs. This works, but it's expensive, inefficient and complex to manage.
Ok, let's get back to a happy place. Everything's running great again, but not so fast. Your favorite data scientist, Dave, just fired off the biggest, baddest query you've ever seen. And well, let's just say everyone is probably calling you right about now because everything is running really slowly.
How do we make sure that Dave can't ruin your weekend? Well, today, I'm excited to announce a massive set of improvements to Redshift, Amazon Redshift Serverless using next generation AI driven scaling and optimization. This is going to be exciting.
These innovations represent monumental improvements to Redshift Serverless internals to address each of the challenges we just spoke about and deliver up to 10x better price performance.
Our first innovation is building a detailed model of the expected load of the data warehouse to proactively scale our capacity. By building a machine learning based forecasting model trained on the historical data of the data warehouse, we can automatically and proactively adjust capacity based on anticipated query load.
But of course, no matter how good we are at predicting what will happen next, we're always going to be surprised. So we need to improve our ability to react and make good decisions in the moment.
For example, when Dave sends his next complex masterpiece, we want to avoid bogging down the production cluster and instead schedule that big query on its own capacity. And because Dave's query is long running, the extra time probably won't be noticed by him. And actually, he might get better results because by optimizing the infrastructure for Dave's query, he might get his results sooner.
But how do we understand each query in real time as it arrives? Like before, we use AI. However, this time, our machine learning models help us understand the resource demands and performance of each query.
And to do this, the query analyzer creates a feature embedding of each query, which takes into account over 50 unique features - things like query structure, types of joins used, and dataset statistics. And these embeddings enable us to better identify the complexity of each query compared to a more naive approach like just looking at the SQL text because as datasets change, what was simple yesterday might not be so simple today.
Now, it turns out that 80% of the queries that run on our data warehouse have been seen before. And for these, we're able to use our encoding to quickly look up information about the query and know with high confidence how it will perform and how it might be optimized to run faster.
But what if we haven't seen a query before? In this case, we have a second small model that's trained on all the queries that have ever been run on your data warehouse. And because this model is small, it can generate a reasonable prediction about how quickly a query will execute. And speed on this part of the process is important because if it's a small query like a dashboard request, we don't want to delay it for long while we do more complex analysis, we want to immediately execute it.
But what if the what if the query is complex? What if Dave is sending us another masterpiece? If our small model estimates that the query is complex, we perform in another analysis with a much larger model that we've trained off all the data warehouse queries that we've ever run, enabling us to understand the likely performance of a query in more detail.
And one of the things we can use this large model to predict is how a particular query will respond to different cluster resource levels. And it's worth doing this additional work because how query scale is actually fairly hard to predict everything scaled like this line linearly life would be easy but sometimes adding more resources doesn't result in significantly better performance. We actually call this sublinear scaling. And with enough resources, every query exhibits this kind of scaling.
But other times adding resources can actually yield greater than linear performance benefits or super linear scaling. This can happen. For example, if a query is memory constrained and constantly swapping memory in and out. When you add enough memory to prevent the swapping, you can see a big performance gain.
So when we're analyzing a query, we need to decide how much resource to allocate to the query on behalf of our customer. And if the crew is exhibiting super linear scaling or even linear scaling, we want to add more resources because the query will run faster and there be no additional cost when when. But at a certain point with every query adding more resources has diminishing returns, you can still get performance improvements but you do so at additional cost.
So this is a harder question for AWS to answer. When should we add resources and when do we stop? You need to be in charge of that decision you and that's why you can now tell Redshift CIST how to optimize your data warehouse query in situations like this. You specify your cost and performance sweet spot on the far left. The data warehouse will run cost optimized and we will only add resources. If it's cost neutral, you move the slider to the right to tell us to more aggressively scale even if adding additional capacity to go faster, adds additional cost and you pick the spot on this slider that makes sense for your business now that we understand what's happening.
Let's look at how all these pieces come together. Here's our data warehouse perfectly sized for steady state workloads of short running queries, things are quiet. Almost two point. Ah, we can see that Redshift CIUS is actually adjusting its base capacity down. We must be going into the weekend and I guess the boss doesn't work on the weekend. So Redshift CIUS is saving us some money by reducing its base capacity.
Uh, oh, looks like Dave does work on the weekend. That's a real doozy of a query. But you can see it's not being run yet. It's actually being analyzed by Redshift CIUS to determine how to best run it. And it looks like we're gonna have to spin up a really big cluster. So Dave had to wait while we figured that out. But now Dave's query is running much faster than it did last time and there was no impact to the rest of the data warehouse.
Dave's happy the boss is happy. You're happy. We're excited by the results the customers are seeing with the new Redshift serverless. This is truly the promise of serverless computing and it's getting smarter all the time.
Well, this wraps up our journey from server four to server less, at least for now. Something tells me that there will be more destinations down the road. But for now, let's shift gears a little and talk about something else.
Normally, we spend a few moments on Monday night diving into our latest AWS chip innovations like Graviton and Trainium. But this year I want to talk to you about an entirely different type of chip, the type of chip that runs a quantum computer. What is a quantum computer that's complicated?
It might be tempting to conceptualize a quantum computer as a really, really fast supercomputer. And that would be great to have a super supercomputer. That model is not useful. A quantum computer is really something else entirely.
We all know the binary bit, we build computers from transistors that store a bit a zero or a one with the presence or absence of a charge inside of a transistor. And we compose these bits into structures and manipulate them with gates to do things like produce floating point numbers and integers. And then we compose those into things like programming languages and databases and operating systems.
On the other hand, quantum computers are constructed out of quantum bits or qubits and a cubit is not a transistor but rather a quantum object like an electron or a photon. We often use the image of a sphere to represent a cubit with the states one and zero at the poles of the sphere. And in this way, it's similar to a classic bit, but you can also have combinations of the zero and one what's called superposition, which means the state of the qubit can be at any point on the surface of the sphere and cubits interact with each other in strange ways which is called entanglement.
And it's the combination of these two characteristics superposition and entanglement that give quantum computers the ability to solve some very hard problems unbelievably quickly. So while we can't compose faster general purpose computers with qubits, we can use them to solve some really interesting problems in areas like chemistry, cryptography, process optimization. And this in turn could help us in fields like agriculture, renewable energy and drug discovery.
That's why in 2019, we established the AWS Center for Quantum Computing on a campus at Caltech. And tonight, I want to give you a little peek into what we've been working on.
Now. Caltech is a good place to start a discussion because it's where quantum computing started. 40 years ago, Richard Feynman first proposed the idea of building a quantum computer and he did this because he knew that a classical computer would never be powerful enough to simulate the interactions of particles in the quantum world. So he postulated that the only way to do this would be to use quantum particles themselves.
Now finally knew it would take us a number of scientific breakthroughs before we could actually realize a quantum computer. About 10 years after this, Peter Shaw, a mathematician at Bell Labs surprised everyone with the discovery of a factoring algorithm, a quantum algorithm that could provide exponential speed up over the best classical number factoring algorithms. If you've heard that quantum computers are going to destroy the internet, it's because of this algorithm.
Now, personally, I'm not losing sleep quite yet. And I'll show you why in a minute. But Shaw's algorithm is a seminal moment in quantum computing because it showed that quantum computers could be useful for solving problems beyond just simulating the quantum world.
Several years later, physicists first began experimenting in labs with small quantum systems consisting of two qubits that could interact and operate with computational gates. This was done in a laboratory setting. 10 years later, scientists figured out how to produce cubits on the same electrical circuits that we use for classical computing. And this marked the beginning of the engineering race to build a useful quantum computer.
So how many qubits would we need to do something useful? We can probably start doing some interesting things in chemistry and physics with a few 100 high quality bits but breaking something like RSA is going to require many thousands of cubits.
You may have seen that quantum computers with hundreds or even thousands of qubits are being produced today. So it's a reasonable question to ask why haven't quantum computers started to change the world? And like many things you have to read the fine print and the fine print with quantum computers says they're noisy and prone to error in all the computers we use today, we do occasionally experience errors, bit flips a zero turns into a one or a one turns into a zero and we use error correction to protect ourselves from these sorts of errors.
For example, we use EC memory which automatically protects against bit flips in our memory system. And the overhead of error correction in a classical computing system is pretty low because bit flips are very rare and because bit flips only happen in one dimension the zero and the one. So with memory, for example, we can typically protect memory with one single parity bit for every eight bits of data, very small overhead.
But in the quantum world noise is much harder, quantum objects are far more sensitive to noise from their environment and cubits store more than the simple zero. And one as we've seen cubits can actually experience errors on two dimensions. They could have bit flips the one and the zero, but they can also have phase flips.
So where are we in the quest to minimize cuban error rate? 15 years ago, this third of the art that we could do was to produce a qubit that had one error every 10 quantum operations. Five years later, we could achieve one error in 100 operations. And today the state of the art is probably about one error per 1000 quantum operations. So we've improved 100 x in 15 years. This is pretty good news.
The problem is cubits are still far too noisy to be useful. The quantum algorithms that we get excited about require billions of operations without an error. We'd need about 22 million more of these screens to show this animation. We looked into the spear for tonight's presentation but didn't work.
So what about using error correction? We can do qubit error correction by encoding a block of physical qubits into what we call a logical qubit. But because the underlying qubits still have a pretty high error rate, we need a lot of physical qubits to create one single logical qubit with our current 0.1% error rate. Each logical qubit requires a few 1000 physical qubits.
Here, we see the same chart that we looked at earlier. But now we're showing the number of physical qubits that we would need to solve a problem with today's 0.1% error rate. And now you can see why I'm not losing sleep at night. Shaw's algorithm will require a very large number of qubits. So today's quantum computers aren't close to where we need them to be to start solving these big hard interesting problems.
But the good news is with error correction, things can improve quite quickly with a further improvement in physical error rate, we can reduce the overhead of error correction significantly. Here, I've adjusted the graph for another 100 x improvement in physical qubit error rate. And you can see that this starts to get these qubit rates down to something you can get your head around. And this is maybe something we can do in the next 10 to 15 years.
But how do we go faster? Well, another way we can speed up. The quest for a reliable qubit is to implement quantum error correction more efficiently. And that's what our team at the AWS Center for Quantum Computing has been hard at work doing today. I'm happy to give you a sneak peek at that look.
This is a quantum device. It's a custom designed chip that's totally fabricated in house by our AWS quantum team. And the unique thing about this chip is how it approaches error correction by separating the bit flips from the face flips. With this prototype device, we've been able to suppress bit flips errors by 100 x by using a passive error correction approach. And this allows us to focus our active error correction on just those phase flips that we looked at.
And by combining both of these approaches, we've shown that we can theoretically achieve quantum error correction six times more efficiently than with standard error correction approaches. Now, while we should be mindful that we're still early in the days of this journey to the error corrected quantum computer. This step taken is an important part of developing the hardware efficient and scalable quantum error correction that we need to solve interesting problems on a quantum computer.
We're going to be sharing more details about these experimental results soon. So if you're interested, stay tuned.
Now, I originally asked my team to bring some really cool equipment here to show us how these chips are built. And then the team showed me how much it would cost if I broke some of that equipment. So instead I have some really cool pictures of that equipment with a strong emphasis on reducing noise in our systems. Quantum chips need to be developed very carefully.
And like most chips, quantum chips begin their life on a silicon wafer. One of the challenges of building a quantum computer is it needs to operate inside of a refrigerated environment near absolute zero. But we need to access the qubits and connect them to a classical computer outside of the refrigerator. So we have to start by bonding multiple chips together. One chip contains the sensitive qubits and the other chip contains the wiring used to read the qubits.
From here, the chip is bonded to the external printed circuit board and the bonded chip and package is then mounted into a gold plated copper mount. The thing I had in my hand two seconds ago, it both provides thermal anchoring to the refrigerator, but it also provides the first level of electromagnetic shielding which in turn protects the chip from bit flips.
The assembly is then carefully mounted onto a dilution refrigerator where it's cool to in a few 10th of tens of thousands of a degree above absolute zero. Not many of you have probably assembled your own PC. So this is a familiar process. This video here is a time lapse video of Cody from the AWS Center for Quantum Computing and it was shot over 2.5 hours. And finally, when he's done with this process, we can try out the new quantum chip.
Of course, there's a lot of core engineering that goes into any chip before you ever lay hands on the silicon. And having good design tools for laying the chip out is also a key part of the development process and an important area of innovation for us, for example, the teams developed a full scale electromagnetic simulation of the chip and that helps us bring down the environmental noise and produce a higher quality qubit.
They've even open sourced their tool kit to run these electromagnetic simulations. And I'm told it works best on Graviton.
Now, we're excited to share our quantum computing milestone and we believe the industry is at the beginning of an exciting new period of quantum innovation, the period of the error corrected cuban, it will be an exciting journey and we will be sure to continue to update you as we go.
So, so whether it's quantum computing, custom hardware reinventing databases or new hypervisors at AWS, it's always day one and we'll continue to innovate on your behalf. And because we know that our innovations allow you to build better experiences for your customers.
So this about seals the deal for another Monday night live, enjoy re invent and thank you for coming
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。