赞
踩
"I'm gonna guess that I'm not the only person in this room who has been paged in the middle of the night because the disk failed on a database server. And I will bet that I'm not the only person here who has been faced with the situation of frantically trying to add resources to my database because I got a surge of traffic and it's about to fall over.
And I'm pretty sure that there are others in this room who have had the experience of having to tell a customer, "You know, I'm sorry, we're gonna have to take down time because I've got to patch the database, right?"
These experiences have taught us that you can't build a highly available application if you don't also have a highly available database.
Hi, I'm Jeff Duffy. I'm a product manager for Amazon DynamoDB. I'm gonna be joined today by Tom and Rich from Amazon Ads. And we're gonna talk to you about how DynamoDB helps you build highly resilient applications.
So I'm gonna talk a little bit about what resilience means and how we think about measuring it. And I'm gonna talk about how DynamoDB's features help you build for resilience.
Tom's gonna come up and explain why Amazon Ads chose to move a critical workload to Dynamo to increase their resilience.
And then Rich is gonna come and, and do a deep dive on the choices that he made in order to, to achieve their resilience goals.
And so when I talk about resilience, what I'm talking about is the ability to adjust to change. And that change comes in several forms for an application.
Usually we first start thinking about it in terms of infrastructure failure, like like a disk failing or a server dying. And certainly we need to be resilient to that, but we also need to be resilient to variance in demand like a surge of new traffic that we have to be able to handle.
And by the way, resilient applications should also be able to handle demand changes in the other direction, right? When things get a little quieter and also system modifications and and the most common example of this is is patching your software, right? Some, some sort of bug hits and I need to apply a patch your system and therefore your database needs to be resilient in order to let you keep operating when you need to do that.
And so when we think about the components of resilience and how we'll measure it. These are the two primary types, disaster recovery is focusing on an entire workload, right? We usually think about DR after something bad happens, it falls over. How am I gonna get back to where I was before that thing happened? Usually, you know, after failure, we also think about this in terms of high availability or HA and HA focuses on the individual components of our application workload, not the thing as a whole.
And with HA, our goal is usually to try to handle the failure of a component and keep operating without having to, to respond, you know, after the event has happened.
So when we think about measuring resilience, the industry typically thinks about it in these two terms, the first of which is our recovery point objective or RPO. And our RPO basically dictates, you know, how much data am I willing to lose if something bad happens uh to my database or application, uh we usually measure it in time.
And so like, for example, if I'm doing backups on a regular schedule is my resilience strategy and I'm doing them every two hours, then I'm gonna expect my RPO to be two hours, right? I could lose up to two hours of data, a recovery time objective or RTO is about how quickly I can get back to normal operations for where I was before.
And so these two measures RPO and RTO are, are the goals that we set for ourselves when we build for resilience to know whether or not we're achieving what we're setting out to do.
And so of course, you can ask yourself, well, why don't I just make my RPO and RTO zero at all times, right? I never wanna lose data. I never wanna go down.
Um and the answer is cost, right? Because the more resilient we are, the more that's gonna cost to do. And not every workload needs to be super highly resilient. We can have dev test workloads where it's ok. if we throw the data away, we're probably gonna create it again anyway, right.
So at AWS, we talk about resilience strategies in our Well Architected program and, and we have four primary ones that we usually talk about.
The first of these is a simple backup and restore and, and this is gonna have resilience, you know, in, in hours, right? RPO and RTO of hours because it takes time to take a backup and to restore it to a new table in DynamoDB's case and then put your application to the new table so you can get back to doing what you're doing. But the advantage here of course is it's very low cost to adopt backup and restore as your primary resilience uh strategy.
The next one we talk about is called pilot light. Uh and this dramatically lowers our RPO and RTO to tens of minutes. Um and this is the one where we kind of cross the threshold between DR and HA a little bit in the sense that we're going to commit to copying our data to multiple places in order to achieve that low RPO and RTO and the cost for pilot light is typically lower because the general approach to this is have multiple copies of your data in multiple places. In AWS we usually mean regions, right?
Um but you're only going to have one copy of your infrastructure in one place, you'll just be ready to deploy that infrastructure near the second copy of data. Should something happen? Typically using, you know, infrastructure as code tools like CloudFormation to do it should the need arise. And this keeps our costs lower because we only have one copy of the infrastructure running.
So the warm standby strategy now we're going down to single digit minutes and, and the way that we achieve this is that we typically have copies of our infrastructure deployed in multiple regions.
Um but we don't scale them both up all the way to handle production traffic. We usually have a primary region where we're gonna have our production traffic there um flowing most of it. And then, you know, we might have a secondary fail over target region. We'll have a trickle of traffic flowing through just so we can make sure you know everything's working.
Um but should we need to shift all the production traffic over? Of course, we can do so and then active, active, right? And this is the scenario where um we have a strategy where we, we want zero data loss, we want zero RPO and we're willing to bear the higher costs of running significant amounts of traffic in multiple places at multiple time or at the same time. Rather, um in order to ensure that I can shift my traffic wherever I need at any time.
So the underpinning of how Dynamo does what it does in order to deliver to you, those higher resiliency features starts with AWS's global infrastructure, right?
And I'm sure you're familiar that we have 32 regions which are the geographic areas of the world where we have resources and those are divided into availability zones, which are groups of resources in a region that are aligned along failure domains like flood plans or power availability network, things like that.
Um and then the services that we provide like DynamoDB offer regional API endpoints that are themselves highly redundant. So the idea being, of course, if that a subregion failure happens within AWS, we're gonna be able to handle it transparently and you can still get to the service that you need to talk to.
So when you're planning for resilience, right, we need to start by, by understanding that just as with security in other parts of AWS, there's a shared responsibility model for resilience as well.
And so AWS is responsible for resilience of the cloud and that means that we're implementing our infrastructure and services to deliver to you the levels of guarantee that we provide, you know, um the, the redundancy that we provide in our networking and our infrastructure in order to make sure that our services are as available and resilient as we, we promise they are.
And it's your responsibility as the customer to have resilience in the cloud. And, and what that means is is that you're following our best practices for DynamoDB, for instance, you know, turning on deletion protection. So you don't accidentally delete production tables if you didn't mean to by testing the backups that you're taking to make sure that they're doing what you expect them to in order to make sure that the resilience goals you're setting are actually going to be met, should you ever need them?
So, Dynamo has some foundational resilience features that just come out of the box where you don't have to configure anything. It's, it's just the way Dynamo works and that starts with Dynamo being serverless, right?
So Dynamo is a serverless service, which means that you don't have to worry about provisioning resources in order to have your database handle the level of traffic you need. We take care of that for you.
Dynamo stores every copy of your data in 3 AZs. When you write a piece of data to a DynamoDB table, we guarantee that it's stored in at least two availability zones before we acknowledge the write. And then we eventually consistently copy it to a third availability zone.
So we provide you an extremely high level of durability within a region. And then Dynamo provides zero downtime updates, which is a very big deal for a lot of our customers. No longer. Do they have to worry about scheduling maintenance outages. In order to patch the software, we handle that transparently behind the scenes for you.
So part of resilience is managing capacity and the need for throughput. And Dynamo offers uh a number of ways in order for you to manage this, um we offer two different capacity modes, the first of which is called provision capacity.
And you can think of this as uh uh uh two knobs on your DynamoDB table that you can turn in order to increase or decrease the amount of read and write throughput that table can handle.
Now, of course, most customers don't want to manually manage these knobs for themselves. Uh and so we offer integration with the AWS Auto Scaling service that allows you to set, you know, uh limits ceilings and then targets for your capacity for your database. And it will automatically scale up and down depending on the traffic throughput that we observe.
We also offer on demand capacity mode. And this is sort of the easy button for managing capacity for Dynamo, it scales all the way to zero. We automatically look at the level of traffic throughput that's coming in if we notice that you need more throughput capacity than your table has. Currently. We'll just go ahead behind the scenes and automatically add that for you.
And so of course, you may be asking, well, why don't I just use on demand all the time? The answer is is that on demand and provision capacity have different pricing structures. Uh provision capacity is a really good choice. If you have very predictable throughput needs that just changes every so often for your database, it's gonna be very cost efficient on demand is excellent.
Uh for workloads that tend to have super spiky uh patterns that also go to zero at, at you know, relatively frequent intervals.
So another set of features that Dynamo offers are recovery features, right? And these start with the point in time recovery.
Um so point in time recovery, your PITR is a continuous log for 35 days of all of the changes that are made to your DynamoDB table. We're always keeping track of those changes for you when you turn PITR on.
Um and you can at your choice store to any second within that 35 day period to a new DynamoDB table. And this protects you against things like accidental overwrites of data in your database where maybe you didn't delete the table but a bad code push went and wrote over a bunch of data that you didn't want it to.
Well, you can go back in time before that code, push to PITR restore to your new table and, and get back to business as you were before.
And of course, DynamoDB offers backup and restore as well. And we offer integration with AWS Backup in order to let you do things like schedule your backups and copy them across regions and so forth as well.
That has a feature called Global Tables that offers multi active multi region replication of your data. And this is nice because there is no primary and thus there's no concept of fail over to implement strategies for resilience. You can write to any replica in any region in a DynamoDB global table at any time that you want. And that change will be eventually consistently replicated to all of the other replicas within that table.
This also allows you to increase your availability guarantees from four nines in a single region to five nines in multiple regions uh for DynamoDB. And again, you don't need to think about fail over. If you want to shift traffic, you just start sending that traffic to a different region replica than you had previously.
And so if we think about the strategies that I talked to before and then the DynamoDB features that I just described and we put those together, this is what it looks like to implement some of those resilient strategies with DynamoDB.
And of course, when we start with backup or restore, right that's gonna be the PITR feature that I described before. Of course, PITR's running, tracking your changes uh over that 35 day period continuously.
And of course, you can schedule backups with DynamoDB and AWS Backup service in order to make sure that you're going to have, uh you know, backups at regular intervals. If that's the strategy you want to adopt.
Now, when we talk about the pilot light strategy, again, remember we're talking about copying that data across multiple regions. And the easiest way to do that in DynamoDB is to use Global Tables of course.
And so what you'll do in a pilot light strategy and remember this is the one where I'm gonna have all of my infrastructure deployed in in one region. And then I'm just gonna have a copy of my data in that other region or maybe a very minimal set of resources.
And so what you can do is that you can have your global table replicas replicate your data to that second region, but you can scale it down if you want to, to a minimum level in order to conserve costs, or you could just leave the table in on-demand mode if you're not gonna use it, and we'll handle scaling it down for you.
The warm standby approach then can be implemented also with Global Tables."
Um and remember again, this is just the change here is going to be that you're deploying more of your infrastructure in that second region. So global tables is already doing that replication for you. It's already hand already handling changes to whatever region that you want to make the changes to. And what that means is that you can have that trickle of traffic always flowing uh through that second region to guarantee that your infrastructure is working the way that you want to and and still be, you know, confident that your data is gonna be consistent in your primary production region.
Finally, when we talk about active active um global tables replicates in an eventually consistent fashion. And so it cannot deliver uh an rpo zero guarantee because it takes some time to replicate that data to another region. However, i talked to a lot of customers who say, you know, i don't need rpo zero, i need rpo 30 seconds or rpo one minute, you know, and, and global tables generally replicate your data in, in much less than a minute, especially if you're staying within uh a given continent. So you can implement an active active approach with global tables, just not quite zero rpo. If you need zero rpo, then the approach you're gonna have to take is to have multiple tables in multiple regions and write to both of those uh you know, as your application works. And of course, you're going to be able to, to manage that uh capacity by in that case for active active, fully provisioning the replicas to the level of traffic that they're going to need in order to process your traffic or as i said, simply use on demand and we'll handle that for you.
And now i'd like to turn it over to tom skinner, um, who's going to talk about how he and amazon ads have decided to use dynamo db.
Tom: Thanks jeff. Well, folks who made it, it's the last session on the last day of reinventing. So i'm tom skinner. I'm a director of software development for amazon advertising measurement. And i will be presenting today a little bit of a quick overview of amazon ads. The discussion of one of our most critical workloads that we've moved to dynamo db and the results that we have achieved.
Amazon ads mission is to help build brands and businesses of all sizes while creating ad experiences that are useful for our customers. We serve advertising on and off amazon properties across connected video devices, web and applications. At its most basic form. advertising drives discovery, advertisers use our advertising programs to build awareness or increase shoppers, consideration of a certain product or service. These advertisements can be about a product sold on or off. Amazon. And we serve advertisers and agencies in all industries including financial services, travel entertainment and more. Advertisers use our analytics and measurement to plan marketing strategies serve and optimize ad delivery and measure advertising outcome.
Because advertisers are using our measurement to make investment decisions it must be accurate and near real time, especially during those peak events where they're optimizing several times throughout the day, such as prime day, given the importance of ads measurement to our customers, our infrastructure must be resilient to both service disruption as well as infrastructure failure. In addition to these calculated metrics, we also calculate a data age which is the entire time in which it takes the pipeline to calculate the metric and make it available to our customers.
As i said, one of our most critical workloads that we operate is our attribution algorithm pipelines. Attribution is the way that advertisers determine how marketing tactics and those subsequent customer interactions contribute to sales conversions or other goals. Attribution is one of the ways that customers answer the question. What was the return on my advertising spend? An example might be a shopper looking for a tv. They'll come to amazon, they'll search and browse and look at various detail pages comparing various models. Eventually they may click on a sponsored ad listing and buy and purchase the tv. Our algorithm must consider all of those interactions, views, clicks and determine which one receives credit for that purchase. Then they'll use our reporting systems to view that performance and and gather uh valid uh metrics such as return on ad spend, which is the ratio of ad spend to sales.
Some of our traffic sources have predictable bell shaped curves, roughly brow matching browsing behavior increasing throughout the day and decreasing their slowing in the evening and they also may increase many times on high volume days such as prime day. We also serve advertising for video advertising such as thursday night football. These events have unpredictable spikes where we can deliver many, many millions of impressions in the span of 90 seconds or less, followed by random periods of quiet to make things interesting for the team. This year, we had both a football game and black friday on the same day.
Um if we connect these concepts back to jeff's introduction, our infrastructure must be resilient to both constantly changing demand. And our ide and our infrastructure should be ideally completely elastic and react automatically to these peak demands. Dynamo db on demand capacity is one of those great features that allow us to hit the easy button.
To give you an idea of the scale of the data that we process and workloads that we have within ads measurement. We have 4/4 petabytes of data. In dynamo db, we process more than 100 billion events daily. And our peak dynamo db throughput is more than 90 million rc us and more than 5 million wc us to compute attribution. The algorithms require a look up of all relevant traffic for the previous 30 days. This look up requires us to do high-scale range scans filtered on date and other identifiers and trillions of those ad identifiers must be searched in a sub second query response time.
When these algorithms were originally built, we chose h base as the tosin technology. And in order to scale query throughput over time, we replicated these hba clusters on a on aws ec2. As you could expect, managing these hundreds of nodes. Even with automation was very painful for our engineers to support those use cases. The team had to add custom modifications to hba code which also made our software upgrades very time consuming and expensive. The cluster ran on ec2 nodes in a single region on older generation hardware. And so we would have at least one unplanned cluster outage every sprint due to hardware retirement or other hardware failure. And the cluster was also in a single region. So we were exposed to those aws uh instances as well.
Because we managed our own cluster, we had to scale in large steps well ahead of peak accruing unnecessary costs. Hba has this lovely feature where you can add nodes, but in order to remove them, you have to rebuild the cluster. No, in late 2022 we set on a journey to migrate our storage layer to a new technology. A team had four high level requirements.
First of all, we had to do this migration with zero downtime to our customers. As i've stated, attribution is critical to our customer experience and they expect near real time uh performance. So we had to do parallel runs. We had to do data comparisons. So we made sure that the numbers weren't changing for our customers. Second, the new system had to have the same or lower level qu uh query latency to enable this cutover.
We implemented a service layer in our data pipelines that we could change the storage layer without having to change the data pipelines, the service layer added latency. So we had to make it up in other ways throughout our data pipeline, we to use a managed solution. If one was available, the c the team currently had about a three month ramp up period to be uh essentially a, a software engineer expert on h base. We wanted to eliminate that. So we also introduced a addition, additional availability risk every time we had to do h based server upgrades. And finally, we had an aggressive migration schedule. We only had budgeted about 30 days to do the full migration due to parallel system cost as well as operational burden on the team.
So what did we do? We first refactored our data pipelines to incorporate and eliminate the heavy hba client that directly talked to the customers and implemented a service layer built on ecs. Then this new service layer gave us a layer of redirection allowed us to do a b testing and latency eval evaluation as well as data quality comparisons through cloudwatch. It gave us rollback capability. And next, we pointed our ingestion process to dual right mode and we loaded both dynamo db and h base at the same time. However, this started occurring migration costs and so the countdown clock was started, we completed the cutover within our one month uh budget in early april. And at the end of this month, we tore down our legacy hba clusters, deleted the code and alarms. We actually had a huge party for all of the team members, including past team members and managers where we essentially had what we call our dead dead cake for the team which we always celebrate whenever we kill or deprecate an old piece of technology.
Since the april cut over, we've seen the following results. Our availability has increased from four nines to five nines when we enabled global tables, developer ramp up time is reduced from three months to two weeks. They no longer have to become hba experts. They can just focus on their business logic. Since the migration, our ticket load is reduced by 40% on the team. This allowed the team to take on two additional charters with no additional head count. Finally, we did this in, in a completely cost neutral manner for our cost for the aws managed solution is less than the cost of our hand managed based clusters.
Now, i'd like to turn it over to richard edwards, the principal engineer on my team who designed this migration and he'll give you a deeper detail on the hb on the uh dynamo.
Hi everyone. Um i'm gonna tempt fate here. But i do want to highlight one key thing before i dive into this. This system is operated through two worldwide global events with zero operational pages for its engineers. We're talking about prime day turkey five and then a local event of diwali with zero pages and zero operational load on the team. And so let's go into it.
So if i wanna highlight the problem that tom was talking about first, this is what we call the internet scale join, essentially we're taking purchase events or any other things that we call conversion events and joining them with a interactions and that can be quite a large one. So let's take an example and put some data numbers to it.
So if we look at 100 billion events a day and we join to the last 14 days of data, we're talking about 1.4 trillion events. If you were loading that into memory or on a emr cluster and running a spark job and those are one kilobyte each, that's 1.4 terabyte, uh petabytes of data. And if you're doing that on an hourly cadence, that's not practical. And so what we did way back is we inserted h base, we decided to pre index the data and make those fetches efficient. And this allowed us to scale out and continue to grow as ads grew. So this made it efficient. But we were hand managing h base here, which was creating some operational pains.
So as we look to replace it, our functional requirements were lowering the operational load, we containing higher availability. Basically, the time every time we would go down, customers weren't getting metrics. And so we needed to be able to either never go down or be able to recover very quickly. But the more critical one is we can't afford data loss. Our customers depend and make critical decisions off of this data and the data has to be right. So losing any data means that we're reducing the quality of the results of the customer. And we're diminishing their ability to make decisions and more importantly, support all existing workloads
We didn't have a single workload writing that we believed we could turn off, they were all critical and all important to a customer. And so our solution was is to take HBase and replace it with DynamoDB.
Now that seems like a very simple conclusion, but we actually worked with HBase for quite a long time. We made serious engineering investments before we reached this decision. And so I wanna walk through some of those to give you an idea of what we were operating.
And so these were HBase operational issues. If we had to pick our top four availabilities on redundancy, disaster recovery, query load balancing and just basic cluster maintenance, you'll imagine taking down a cluster updating. It was actually not very straightforward, it was generally about an eight hour process.
And so let's look at availability zone. As Tom mentioned, we replicated our data. So we had hand managed replication configurations across our clusters and each cluster was striped into its own availability zone.
Now, you're thinking, well, why aren't they all in three AZs each? Well, the problem is is that if one AZ goes out and each one is striped and they all use that same AZ, you just take 33% of your capacity out of all your clusters.
And so we followed the EMR model of every cluster fails independently. So that when we lose one cluster, we remain two thirds of all of our capacity. However, fun thing about managing your own application. If the one in the middle of the chain goes out at two in the morning, your engineer has to wake up and manually edit replication strings to recover replication between availability zone one and three, not fun.
Um and then there's also query load balancing. The thick client that Tom mentioned previously, this allowed a query to execute against any of the cluster, giving results and managing the load. However, you can imagine in this configuration, if a cluster goes down, well, the thick client will retry but it could still go back to the bad cluster.
And so you're getting woken up at three in the 3 a.m. to take out a cluster out of service because it's failed that's silly. So we added an automatic cluster removal capability using Lambda and cluster health metrics. When a cluster went unhealthy, we removed it from service automatically.
And so now an engineer is getting paid at three in the morning to fix a cluster, but the customer is not impacted so slightly better but not great. And so what really is here is the cluster ownership overhead is what really causes a lot of headaches.
The custom HBase builds ramping STes on HBase and its internals, but more importantly, SME or what we like to call subject matter, expert blast radius. So this picture on the right of me on vacation halfway across the world after an hour after this was taken, I was paged. A routine maintenance went south and they couldn't get a cluster back online.
So I don't know how many of you have been on vacation and you just get a reach out page and say, hey, I need help and that's what happens with HBase for us. At least it takes very specialized work for our engineers to understand and operate it because they, they know the internals deeply just to even do basic cluster maintenance.
But then the real danger to second pain point is low elasticity. As Tom mentioned, we're dealing with multiple variable workloads and we always have to be prepared for the worst. And so we were consistently over SC and over provisioning which is just not financially efficient and so DynamoDB solved these pain points for us.
There's no OS, there's no security patching, there's no redundancy management and it's fully elastic. So we're able to control our cost by scaling up and scaling down accordingly. And me selfishly, I can go on vacation and you page, which is a win for me and definitely most for our engineers as we continue to grow the team.
And so, but to get these results, we're gonna go through two ation categories that we looked at table structure, throughput and table management. Now these are all interconnected, these all impact resiliency together. You can't have one without the other.
And so your, your table structure is good, great. You can get good access. But does that mean you can get good throughput? And as you start looking at table structure and throughput, how do you manage your tables? You need 12 multiple tables GSI, SGTS, all of these things matter.
And so when you put these two together, they really come back to Jeff's point on the shared responsibility model, we as customers of AWS, we are responsible for the brown portion of this box. We have to build applications that can work within AWS's supported limits and make sure that our applications can stay available.
And let's look at what this looks like from an application standpoint. As a naive toy, we have a predictable load and application in DynamoDB as we change that load, the application may keep working. But if we have a bad design, it could fail. DynamoDB won't fail. They guarantee you the resiliency and the capabilities that stay available, but your application may fail.
And so it does on the shared responsibility model, it's up to us to design and avoid those failures from either timeouts, retries, et cetera. And so as we walk through table structure, let's look at our workload of how we are looking at data for our application.
20% of our workload is query all interaction data regardless no filtering, no criteria, 80% of the workload is targeted filtering of data and selection. And so let's visualize this. So to make this a little more concrete. So if you look at this block, we have a partition of data with a partition key, a sort key.
And so within this data, everything is related to that partition and you have nice grouped blocks on the attribute and everything under that attribute is time sorted. And so this is the way that we actually looked at the data on each base. And this is how our data was laid out. This is just data in terms of DynamoDB.
And so what we really looked at is when we looked to migrate is we had three times series scheme as we can consider what we call fully sorted, partially sorted, which we just saw and then scatter sort it. And then let's see what these look like as a table structure.
And don't worry if you, these are too much to take on this slide, we're gonna walk through each one with a detailed data example. And so we had to decide which one of these we wanted to implement and which one is gonna be best for our application and our customers.
And so the partially sort that we looked at which would be a lift and shift for us. You could see that we have the ability to access and filter directed data perfectly. If we wanna be able to support the other 20% of the workload, the customer has to read the data and resort it or in DynamoDB terms at a GSI to get a fully sorted view.
Now, this is what it looks like from a query view. You can specify a single data piece and you'll get one query result. If you want to do multiple queries, you start also encountering a high throughput risk. If I need to select 20 100 or 1000 i have to execute a query against each partition key, which is the same.
So I'm increasing my risk of hot keys and that's why we're calling this a throughput risk and you have to execute all of those queries in order to get every piece of relevant information.
Now, there's the fully certed version, which is a little nicer, all of your data is just time sorted. You throw away the ability to filter select and you just say i'll read everything and then i can make client side filters if i need to.
And so what this looks like when you query is pretty straightforward. If I wanna query 20 items, 10 items, 1000. It's just one query, one connection. However, oh, sorry. What, what? There's still a throughput risk attached to it if we start ramping up a lot of queries externally. And we get a lot of queries for the say id one. In this example, we still could get a very unfortunate hot key incident.
And so that's still a throughput risk from our perspective. Now there's the scatter sorted which tries to distribute the keys by the partition and the attribute. And so now we can still subselect data and we can scatter it so that we reduce our throughput risk.
So let's see what it looks like. When we query, we're essentially saying that each key, no matter which piece you're subselect and you're going to a random partition underneath. So you're not, you're repeatedly hitting the same key. And so this is gonna reduce your throughput risk.
And so this looks like it'll work, it'll solve 80% of the workload. But what about the other 20%? Well, we still like we selected this and we tested it and the way we solved the other 20% is we added a GSI DynamoDB supports this out of the box. I don't have to write any additional code. It can replicate and pivot the data for me.
And so with the GSI I have the other 20% of the workload and I have the 80% on the primary table. So we should be done. Right. Well, not quite query amplification. This isn't a DynamoDB concern. This is an application concern. There's no DynamoDB throughput, but each of those scattered pieces is a query.
And if I have to run 100 queries, that's ok. For a small workload, it might be fine 1000. But what if you got to do 20,000, 100,000. These are the things we started seeing when we push scale.
And so what it really looks like from an application point of view, this is query amplification for one request. You're running multiple queries either in parallel or serially to try and service the request. And at high scale for Amazon, it doesn't work.
And so we were back to the drawing board.
And so what we ended up deciding with is let's use the one that has no query amplification but has a throughput risk. Let's find a design that removes the risk because this is a very simple application concept for us from a query amplification standpoint, one query, one request to DynamoDB and then we just have to solve to make sure that we don't get throttled based on how we set up the rest of the data.
And so this is what brings us into our actual selection is fully sorted. It gives us a simpler query paradigm and allows us to service all of the request with one risk and we're gonna solve it. And so that's why we talk about throughput. We've selected a design that has a throughput risk, but can we solve it? And the answer is of course, yes.
And so here's your default throughput constraints. DynamoDB will support 3000 RCUs strictly strongly consistent, 6000 eventually consistent. And for our applications, we're entirely working in the eventually consistent space. And so a very simple way to increase that write more copies of your data, just writing three copies of this data would triple your throughput. Assuming you can uniformly query across the copies, you now have more throughput. And because you're writing the same copies, you're actually increasing, you're strongly consistent and you're eventually consistent.
So now that rather than having 9000, we have nine and 18, but we can get even fancier, we can use GSIs to create read only replicas now read only replica for a GSI won't increase your strongly consistent throughput, but it will increase your eventually consistent assuming you continue to uniformly distribute. So now we have 36,000 RCUs. So we're starting to push the fact that i can get more throughput out of hitting a same key.
In our case, if i just uniformly distribute across my copies and then we can go even further and do the same thing with a GT, the GT gives us an extra added advantage. Not only am i copying my data, but i also have gained a regional redundancy. If a region was to go down, i'd still have data available in another region and could access it. And so you'd have reduced capacity but higher availability.
And so we can get even more extreme and we can go to a static replication plus GT plus GSI now, this is a bit odd example, but to just put it into perspective, eventually consistent throughput on this, if you balance your queries properly, it's 72,000 RCUs. That's a really big number. It's bigger than actually what we need it, but it's just showing that it can be done. And so if you want to put that in data terms, that's about 288 megabytes at four kilobytes per RCU for a single partition key. That's a lot.
And so this is what we ended up using. This is our production setup, two static replication, two GSI read replicas, this gives us 36,000 RCUs eventually consistent throughput or 100 and 44 megabytes per partition key approximately. Now this is all theoretical, you won't actually hit the max because you'll also have other keys competing for the partition. But in theory, you have really significantly reduced the likelihood of throttles from repeatedly hitting the same key.
And so how do we decide this? This seems odd. How do you know this is right? When we did a/b testing, we tried different configurations and different setups such as just deactivating one of the GSIs because we've written the ability to uniformly distribute our queries, turned one off, tested it if it worked great. If it didn't try a different configuration and trial and error, basically allowed us that this was the right cost profile and ability profile for us.
Now, there's a catch here. What i described on the previous side looks kind of fixed, especially with static replication. That's one of those things that you have to set up right away. And so now we're gonna talk about stable management here. Are there one way door decisions in that setup?
And so when Amazon, what we mean by a one way door is an architecture is inflexible or it's very hard to go back through that decision. And i think if any of you as developers or as SDNs for software development managers, the last thing you want to hear from an engineer is well, it works, but i'm gonna go have to redesign it if we want it to work better. It's really not an answer you wanna hear. You really want the architecture to be flexible, able to do more.
And so this is where things got interesting. How do we go throughput? How do we shrink throughput? Because we showed the static replication that it's giving us high throughput? But do i need it all the time? Do i need it for all the data?
And so if you look at our configuration, well, the simple way to add more throughput is i could add a third replica. Ok. How long does that take? Well, if you, let's say you stored 100 days of data, it would take you 100 days wall clock time to queue that additional throughput. That's not great. Ok. How else could i get it? I could write a tool that would replicate all my data for myself. Ok? That's gonna cost money because you gotta go read all of your data out of your table, you gotta read and write it back. That's not great either.
So how can we do this through table management? Well, if we look at what our table set looks like now, we basically have a primary table in two GSI replicas. What if i use DynamoDB's existing features just to add a new GT on the fly, let it copy the data for me rather than adding more static replicas. And in essence, what this is doing is now we're flexible, we can grow our capacity or we can add another GSI replica. We've now replicated the data in another direction.
And also we could take that capacity away by deleting a replica, but it looks great on paper. And so then you really run into this in configuration, you can have as many AWS regions, as many GSIs as you want to grow and shrink your replication to grow throughput. Sounds great. But do i want to replicate four petabytes away from one AWS region to another? How long will that take? It might be faster to drive a truck from one data center to the other to move four petabytes of data? Um and so it's not really great from an actual service availability standpoint.
And so if we really think about it, we sh at our table. And so what we're actually operating on rather than a single table is mini tables. We've split the tables such that i only need to replicate a subset of the data or a component that actually needs the additional capacity. And the way we did this is just simple mod the mod operator just to your primary key tells you which shard to use.
And so you're thinking, well, how do you know you have enough shirts? Well, what we did is we just said, here's our data growth, here's a projected curve. We're gonna just make it large enough so that we don't need to resize this thing in our lifetime or the next generation's lifetime. So we have sufficiently a large enough number of tables that we never need to resize it. That was our design philosophy to avoid that challenge and i'd love to be wrong there because then that means my business is doing awesome. Um and then we'll go solve it when that problem arises.
But what this has allowed us to do is to pivot how we think about tables. We now look at tables as a table set. We've built a utility called the table set manager. Really? Each table is a shard and it can have any state. It could have a GSI, it could have no GSI, it could have GTs with no GSIs, everything can be configured independently at a smaller grain of data.
And so the time that it's gonna take to create a replica is significantly reduced. And so this is giving us the flexibility to control our cost and scale at a more fine grain than if we had just done everything as a single table. And so what this has allowed us to do is we are operating in a pilot light mode. We have sufficient copies of our data that we can operate and sustain high throughput increases or failures.
And if we need to, we could evaluate moving to warm standby and active, active. Those options were never available to us on HBase because we are hand managing our clusters and capacity. We are significantly constrained by our own architecture and understanding of it.
And so our application views shifted, we have variable load, we have our application. But DynamoDB no longer looks like a sign on DynamoDB instance to us, it's now an elastic set of DynamoDB tables that we're growing and shrinking as needed to meet our demand and continue to grow and improve our business.
And so what this has allowed us to do is really great results. And so let's talk about this. We are hitting 80% of our capacity utilization so very low waste, but more importantly, 0.008% throttles per 60 million RCUs, as Tom mentioned, we're using 90 million plus RCUs as a peak. So you can imagine that's a very significant low amount of throttles, which means very low disruption to our customers and our clients using our service.
But more importantly, a better operational recovery. And so if you look at this slide, one of our customers, i was having a bad day, the software deployment went bad and our capacity utilization plummeted. If this was on HBase, the recovery which you see is quite sharp, we would have been hand managing and helping them recover through joint operational efforts with DynamoDB. We didn't have to do anything. We were able to watch them recover, roll back their bats off for deployment, no intervention from our side. And so that's a really good win for us from an operational standpoint.
And so in conclusion, we have high availability and high throughput. And we're on a pilot i operating model which is allowing us to have higher redundancy and better control over our availability for our customers. And which also gives us the dynamic throughput that we need to control our cost and cost is pretty important because our scale, a small change could have significant financial impacts.
And so what's next for us? Where are we going? We're gonna evaluate pilot light to warm standby and act effective as we continue to grow. And the importance of our workloads increases the ability to tolerate downtime or failure is gonna be less tolerable. And so we may enter a world where having five minutes or 10 minutes of downtime or blip in availability is no longer acceptable. We need to be able to guarantee that we're always on and always available.
But then more importantly, this design pattern, we're gonna double down on, we found multiple applications within our own org and across ads that can use these same viewpoint and lens in order to improve and decrease the operational loads on their teams and better utilize DynamoDB.
And so i like to turn it back over to Jeff to give a summary on his talk. I hate the bum again, sir. Thank you. So if, if you take three things today away from our discussion, they should be these that you can't build a resilient application unless you also have a resilient database and that DynamoDB offers you a very rich set of resilience features to allow you to build applications to whatever level of resilience you need. And finally, AWS underpins DynamoDB to help you build for resilience for your entire application and not just for Dynamo, I'd like to thank everybody for coming today. Your feedback is really important to us. So uh
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。