About this Course
If you want to break into cutting-edge AI, this course will help you do so. Deep learning engineers are highly sought after, and mastering deep learning will give you numerous new career opportunities. Deep learning is also a new "superpower" that will let you build AI systems that just weren't possible a few years ago.
In this course, you will learn the foundations of deep learning. When you finish this class, you will:
- Understand the major technology trends driving Deep Learning
- Be able to build, train and apply fully connected deep neural networks
- Know how to implement efficient (vectorized) neural networks
- Understand the key parameters in a neural network's architecture
This course also teaches you how Deep Learning actually works, rather than presenting only a cursory or surface-level description. So after completing it, you will be able to apply deep learning to a your own applications. If you are looking for a job in AI, after this course you will also be able to answer basic interview questions.
This is the first course of the Deep Learning Specialization.
Introduction to deep learning
Be able to explain the major trends driving the rise of deep learning, and understand where and how it is applied today.
Welcome - 5m
0:00
Hello and welcome. As you probably know, deep learning has already transformed traditional internet businesses like web search and advertising. But deep learning is also enabling brand new products and businesses and ways of helping people to be created. Everything ranging from better healthcare, where deep learning is getting really good at reading X-ray images to delivering personalized education, to precision agriculture, to even self driving cars and many others. If you want to learn the tools of deep learning and be able to apply them to build these amazing things, I want to help you get there. When you finish the sequence of courses on Coursera, called the specialization, you will be able to put deep learning onto your resume` with confidence. Over the next decade, I think all of us have an opportunity to build an amazing world, amazing society, that is AI powers, and I hope that you will play a big role in the creation of this AI power society. So that, let's get started. I think that AI is the new electricity. Starting about 100 years ago, the electrification of our society transformed every major industry, every ranging from transportation, manufacturing, to healthcare, to communications and many more. And today, we see a surprisingly clear path for AI to bring about an equally big transformation. And of course, the part of AI that is rising rapidly and driving a lot of these developments, is deep learning. So today, deep learning is one of the most highly sought after skills and technology worlds. And through this course and a few causes after this one, I want to help you to gain and master those skills. So here's what you learn in this sequence of courses also called a specialization on Coursera. In the first course, you learn about the foundations of neural networks, you learn about neural networks and deep learning. This video that you're watching is part of this first course which last four weeks in total. And each of the five courses in the specialization will be about two to four weeks, with most of them actually shorter than four weeks. But in this first course, you'll learn how to build a new network including a deep neural network and how to train it on data. And at the end of this course, you'll be able to build a deep neural network to recognize, guess what? Cats. For some reason, there is a cat neem running around in deep learning. And so, following tradition in this first course, we'll build a cat recognizer. Then in the second course, you learn about the practical aspects of deep learning. So you learn, now that you've built in your network, how to actually get it to perform well. So you learn about hyperparameter tuning, regularization, how to diagnose price and variants and advance optimization algorithms like momentum armrest prop and the ad authorization algorithm. Sometimes it seems like there's a lot of tuning, even some black magic and how you build a new network. So the second course which is just three weeks, will demystify some of that black magic. In the third course which is just two weeks, you learn how to structure your machine learning project. It turns out that the strategy for building a machine learning system has changed in the era of deep learning. So for example, the way you switch your data into train, development or dev also called holdout cross-validation sets and test sets, has changed in the era of deep learning. So whether the new best practices are doing that and whether if you were training set and your test come from different distributions, that's happening a lot more in the era of deep learning. So how do you deal with that? And if you've heard of end to end deep learning, you also learn more about that in this third course and see when you should use it and maybe when you shouldn't. The material in this third course is relatively unique. I'm going to share of you a lot of the hard one lessons that I've learned, building and shipping, quite a lot of deep learning products. As far as I know, this is largely material that is not taught in most universities that have deep learning courses. But I really hope you to get your deep learning systems to work well. In the next course, we'll then talk about convolutional neural networks, often abbreviated CNNs. Convolutional networks or convolutional neural networks are often applied to images. So you learn how to build these models in course four. Finally, in course five, you learn sequence models and how to apply them to natural language processing and other problems. So sequence models includes models like recurrent neural networks abbreviated RNNs and LSTM models, sense for a long short term memory models. You'll learn what these terms mean in course five and be able to apply them to natural language processing problems. So you learn these models in course five and be able to apply them to sequence data. So for example, natural language is just a sequence of words, and you also understand how these models can be applied to speech recognition, or to music generation, and other problems. So through these courses, you'll learn the tools of deep learning, you'll be able to apply them to build amazing things, and I hope many of you through this will also be able to advance your career. So that, let's get started. Please go on to the next video where we'll talk about deep learning applied to supervise learning.
What is a neural network? - 7m
0:01
The term, Deep Learning, refers to training Neural Networks, sometimes very large Neural Networks. So what exactly is a Neural Network? In this video, let's try to give you some of the basic intuitions.
0:12
Let's start to the Housing Price Prediction example. Let's say you have a data sets with six houses, so you know the size of the houses in square feet or square meters and you know the price of the house and you want to fit a function to predict the price of the houses, the function of the size. So if you are familiar with linear regression you might say, well let's put a straight line to these data so and we get a straight line like that. But to be Pathans you might say well we know that prices can never be negative, right. So instead of the straight line fit which eventually will become negative, let's bend the curve here. So it just ends up zero here. So this thick blue line ends up being your function for predicting the price of the house as a function of this size. Whereas zero here and then there's a straight line fit to the right.
1:04
So you can think of this function that you've just fit the housing prices as a very simple neural network. It's almost as simple as possible neural network. Let me draw it here.
1:17
We have as the input to the neural network the size of a house which one we call x. It goes into this node, this little circle and then it outputs the price which we call y. So this little circle, which is a single neuron in a neural network, implements this function that we drew on the left.
1:43
And all the neuron does is it inputs the size, computes this linear function, takes a max of zero, and then outputs the estimated price.
1:53
And by the way in the neural network literature, you see this function a lot. This function which goes to zero sometimes and then it'll takes of as a straight line. This function is called a ReLU function which stands for rectified linear units. So R-E-L-U. And rectify just means taking a max of 0 which is why you get a function shape like this.
2:23
You don't need to worry about ReLU units for now but it's just something you see again later in this course. So if this is a single neuron, neural network, really a tiny little neural network, a larger neural network is then formed by taking many of the single neurons and stacking them together. So, if you think of this neuron that's being like a single Lego brick, you then get a bigger neural network by stacking together many of these Lego bricks. Let's see an example.
2:57
Let’s say that instead of predicting the price of a house just from the size, you now have other features. You know other things about the host, such as the number of bedrooms, I should have wrote [INAUDIBLE] bedrooms, and you might think that one of the things that really affects the price of a house is family size, right? So can this house fit your family of three, or family of four, or family of five? And it's really based on the size in square feet or square meters, and the number of bedrooms that determines whether or not a house can fit your family's family size. And then maybe you know the zip codes, in different countries it's called a postal code of a house. And the zip code maybe as a future to tells you, walkability? So is this neighborhood highly walkable? Thing just walks the grocery store? Walk the school? Do you need to drive? And some people prefer highly walkable neighborhoods. And then the zip code as well as the wealth maybe tells you, right. Certainly in the United States but some other countries as well. Tells you how good is the school quality. So each of these little circles I'm drawing, can be one of those ReLU, rectified linear units or some other slightly non linear function. So that based on the size and number of bedrooms, you can estimate the family size, their zip code, based on walkability, based on zip code and wealth can estimate the school quality. And then finally you might think that well the way people decide how much they're will to pay for a house, is they look at the things that really matter to them. In this case family size, walkability, and school quality and that helps you predict the price.
4:46
So in the example x is all of these four inputs.
4:53
And y is the price you're trying to predict.
4:57
And so by stacking together a few of the single neurons or the simple predictors we have from the previous slide, we now have a slightly larger neural network. How you manage neural network is that when you implement it, you need to give it just the input x and the output y for a number of examples in your training set and all this things in the middle, they will figure out by itself.
5:25
So what you actually implement is this. Where, here, you have a neural network with four inputs. So the input features might be the size, number of bedrooms, the zip code or postal code, and the wealth of the neighborhood. And so given these input features, the job of the neural network will be to predict the price y. And notice also that each of these circle, these are called hidden units in the neural network, that each of them takes its inputs all four input features. So for example, rather than saying these first nodes represent family size and family size depends only on the features X1 and X2. Instead, we're going to say, well neural network, you decide whatever you want this known to be. And we'll give you all four of the features to complete whatever you want. So we say that layers that this is input layer and this layer in the middle of the neural network are density connected. Because every input feature is connected to every one of these circles in the middle. And the remarkable thing about neural networks is that, given enough data about x and y, given enough training examples with both x and y, neural networks are remarkably good at figuring out functions that accurately map from x to y.
6:48
So, that's a basic neural network. In turns out that as you build out your own neural networks, you probably find them to be most useful, most powerful in supervised learning incentives, meaning that you're trying to take an input x and map it to some output y, like we just saw in the housing price prediction example. In the next video let's go over some more examples of supervised learning and some examples of where you might find your networks to be incredibly helpful for your applications as well.
Supervised Learning with Neural Networks - 8m
0:03
There's been a lot of hype about neural networks. And perhaps some of that hype is justified, given how well they're working. But it turns out that so far, almost all the economic value created by neural networks has been through one type of machine learning, called supervised learning. Let's see what that means, and let's go over some examples. In supervised learning, you have some input x, and you want to learn a function mapping to some output y. So for example, just now we saw the housing price prediction application where you input some features of a home and try to output or estimate the price y. Here are some other examples that neural networks have been applied to very effectively. Possibly the single most lucrative application of deep learning today is online advertising, maybe not the most inspiring, but certainly very lucrative, in which, by inputting information about an ad to the website it's thinking of showing you, and some information about the user, neural networks have gotten very good at predicting whether or not you click on an ad. And by showing you and showing users the ads that you are most likely to click on, this has been an incredibly lucrative application of neural networks at multiple companies. Because the ability to show you ads that you're more likely to click on has a direct impact on the bottom line of some of the very large online advertising companies.
1:30
Computer vision has also made huge strides in the last several years, mostly due to deep learning. So you might input an image and want to output an index, say from 1 to 1,000 trying to tell you if this picture, it might be any one of, say a 1000 different images. So, you might us that for photo tagging. I think the recent progress in speech recognition has also been very exciting, where you can now input an audio clip to a neural network, and have it output a text transcript. Machine translation has also made huge strides thanks to deep learning where now you can have a neural network input an English sentence and directly output say, a Chinese sentence. And in autonomous driving, you might input an image, say a picture of what's in front of your car as well as some information from a radar, and based on that, maybe a neural network can be trained to tell you the position of the other cars on the road. So this becomes a key component in autonomous driving systems. So a lot of the value creation through neural networks has been through cleverly selecting what should be x and what should be y for your particular problem, and then fitting this supervised learning component into often a bigger system such as an autonomous vehicle. It turns out that slightly different types of neural networks are useful for different applications. For example, in the real estate application that we saw in the previous video, we use a universally standard neural network architecture, right? Maybe for real estate and online advertising might be a relatively standard neural network, like the one that we saw.
3:13
For image applications we'll often use convolution on neural networks, often abbreviated CNN.
3:21
And for sequence data. So for example, audio has a temporal component, right? Audio is played out over time, so audio is most naturally represented as a one-dimensional time series or as a one-dimensional temporal sequence. And so for sequence data, you often use an RNN, a recurrent neural network. Language, English and Chinese, the alphabets or the words come one at a time. So language is also most naturally represented as sequence data. And so more complex versions of RNNs are often used for these applications. And then, for more complex applications, like autonomous driving, where you have an image, that might suggest more of a CNN convolution neural network structure and radar info which is something quite different. You might end up with a more custom, or some more complex, hybrid neural network architecture.
4:20
So, just to be a bit more concrete about what are the standard CNN and RNN architectures. So in the literature you might have seen pictures like this. So that's a standard neural net. You might have seen pictures like this. Well this is an example of a Convolutional Neural Network, and we'll see in a later course exactly what this picture means and how can you implement this. But convolutional networks are often use for image data. And you might also have seen pictures like this. And you'll learn how to implement this in a later course. Recurrent neural networks are very good for this type of one-dimensional sequence data that has maybe a temporal component. You might also have heard about applications of machine learning to both Structured Data and Unstructured Data. Here's what the terms mean. Structured Data means basically databases of data.
5:19
So, for example, in housing price prediction, you might have a database or the column that tells you the size and the number of bedrooms. So, this is structured data, or in predicting whether or not a user will click on an ad, you might have information about the user, such as the age, some information about the ad, and then labels why that you're trying to predict. So that's structured data, meaning that each of the features, such as size of the house, the number of bedrooms, or the age of a user, has a very well defined meaning. In contrast, unstructured data refers to things like audio, raw audio, or images where you might want to recognize what's in the image or text. Here the features might be the pixel values in an image or the individual words in a piece of text. Historically, it has been much harder for computers to make sense of unstructured data compared to structured data. And the fact the human race has evolved to be very good at understanding audio cues as well as images. And then text was a more recent invention, but people are just really good at interpreting unstructured data. And so one of the most exciting things about the rise of neural networks is that, thanks to deep learning, thanks to neural networks, computers are now much better at interpreting unstructured data as well compared to just a few years ago. And this creates opportunities for many new exciting applications that use speech recognition, image recognition, natural language processing on text,
6:56
much more than was possible even just two or three years ago. I think because people have a natural empathy to understanding unstructured data, you might hear about neural network successes on unstructured data more in the media because it's just cool when the neural network recognizes a cat. We all like that, and we all know what that means. But it turns out that a lot of short term economic value that neural networks are creating has also been on structured data, such as much better advertising systems, much better profit recommendations, and just a much better ability to process the giant databases that many companies have to make accurate predictions from them. So in this course, a lot of the techniques we'll go over will apply to both structured data and to unstructured data. For the purposes of explaining the algorithms, we will draw a little bit more on examples that use unstructured data. But as you think through applications of neural networks within your own team I hope you find both uses for them in both structured and unstructured data.
8:02
So neural networks have transformed supervised learning and are creating tremendous economic value. It turns out though, that the basic technical ideas behind neural networks have mostly been around, sometimes for many decades. So why is it, then, that they're only just now taking off and working so well? In the next video, we'll talk about why it's only quite recently that neural networks have become this incredibly powerful tool that you can use.
Why is Deep Learning taking off? - 10m
0:00
if the basic technical idea is behind deep learning behind your networks have been around for decades why are they only just now taking off in this video let's go over some of the main drivers behind the rise of deep learning because I think this will help you that the spot the best opportunities within your own organization to apply these to over the last few years a lot of people have asked me Andrew why is deep learning certainly working so well and when a marsan question this is usually the picture I draw for them let's say we plot a figure where on the horizontal axis we plot the amount of data we have for a task and let's say on the vertical axis we plot the performance on above learning algorithms such as the accuracy of our spam classifier or our ad click predictor or the accuracy of our neural net for figuring out the position of other calls for our self-driving car it turns out if you plot the performance of a traditional learning algorithm like support vector machine or logistic regression as a function of the amount of data you have you might get a curve that looks like this where the performance improves for a while as you add more data but after a while the performance you know pretty much plateaus right suppose your horizontal lines enjoy that very well you know was it they didn't know what to do with huge amounts of data and what happened in our society over the last 10 years maybe is that for a lot of problems we went from having a relatively small amount of data to having you know often a fairly large amount of data and all of this was thanks to the digitization of a society where so much human activity is now in the digital realm we spend so much time on the computers on websites on mobile apps and activities on digital devices creates data and thanks to the rise of inexpensive cameras built into our cell phones accelerometers all sorts of sensors in the Internet of Things we also just have been collecting one more and more data so over the last 20 years for a lot of applications we just accumulate a lot more data more than traditional learning algorithms were able to effectively take advantage of and what new network lead turns out that if you train a small neural net then this performance maybe looks like that if you train a somewhat larger Internet that's called as a medium-sized internet to fall in something a little bit better and if you train a very large neural net then it's the form and often just keeps getting better and better so couple observations one is if you want to hit this very high level of performance then you need two things first often you need to be able to train a big enough neural network in order to take advantage of the huge amount of data and second you need to be out here on the x axes you do need a lot of data so we often say that scale has been driving deep learning progress and by scale I mean both the size of the neural network we need just a new network a lot of hidden units a lot of parameters a lot of connections as well as scale of the data in fact today one of the most reliable ways to get better performance in the neural network is often to either train a bigger network or throw more data at it and that only works up to a point because eventually you run out of data or eventually then your network is so big that it takes too long to train but just improving scale has actually taken us a long way in the world of learning in order to make this diagram a bit more technically precise and just add a few more things I wrote the amount of data on the x-axis technically this is amount of labeled data where by label data I mean training examples we have both the input X and the label Y I went to introduce a little bit of notation that we'll use later in this course we're going to use lowercase alphabet to denote the size of my training sets or the number of training examples this lowercase M so that's the horizontal axis couple other details to this Tigger in this regime of smaller training sets the relative ordering of the algorithms is actually not very well defined so if you don't have a lot of training data is often up to your skill at hand engineering features that determines the foreman so it's quite possible that if someone training an SVM is more motivated to hand engineer features and someone training even large their own that may be in this small training set regime the SEM could do better so you know in this region to the left of the figure the relative ordering between gene algorithms is not that well defined and performance depends much more on your skill at engine features and other mobile details of the algorithms and there's only in this some big data regime very large training sets very large M regime in the right that we more consistently see largely Ronettes dominating the other approaches and so if any of your friends ask you why are known as you know taking off I would encourage you to draw this picture for them as well so I will say that in the early days in their modern rise of deep learning it was scaled data and scale of computation just our ability to Train very large dinner networks either on a CPU or GPU that enabled us to make a lot of progress but increasingly especially in the last several years we've seen tremendous algorithmic innovation as well so I also don't want to understate that interestingly many of the algorithmic innovations have been about trying to make neural networks run much faster so as a concrete example one of the huge breakthroughs in your networks has been switching from a sigmoid function which looks like this to a railer function which we talked about briefly in an early video that looks like this if you don't understand the details of one about the state don't worry about it but it turns out that one of the problems of using sigmoid functions and machine learning is that there these regions here where the slope of the function would gradient is nearly zero and so learning becomes really slow because when you implement gradient descent and gradient is zero the parameters just change very slowly and so learning is very slow whereas by changing the what's called the activation function the neural network to use this function called the value function of the rectified linear unit our elu the gradient is equal to one for all positive values of input right and so the gradient is much less likely to gradually shrink to zero and the gradient here the slope of this line is zero on the left but it turns out that just by switching to the sigmoid function to the rayleigh function has made an algorithm called gradient descent work much faster and so this is an example of maybe relatively simple algorithm in Bayesian but ultimately the impact of this algorithmic innovation was it really hope computation so the regimen quite a lot of examples like this of where we change the algorithm because it allows that code to run much faster and this allows us to train bigger neural networks or to do so the reason or multi-client even when we have a large network roam all the data the other reason that fast computation is important is that it turns out the process of training your network this is very intuitive often you have an idea for a neural network architecture and so you implement your idea and code implementing your idea then lets you run an experiment which tells you how well your neural network does and then by looking at it you go back to change the details of your new network and then you go around this circle over and over and when your new network takes a long time to Train it just takes a long time to go around this cycle and there's a huge difference in your productivity building effective neural networks when you can have an idea and try it and see the work in ten minutes or maybe ammos a day versus if you've to train your neural network for a month which sometimes does happened because you get a result back you know in ten minutes or maybe in a day you should just try a lot more ideas and be much more likely to discover in your network and it works well for your application and so faster computation has really helped in terms of speeding up the rate at which you can get an experimental result back and this has really helped both practitioners of neuro networks as well as researchers working and deep learning iterate much faster and improve your ideas much faster and so all this has also been a huge boon to the entire deep learning research community which has been incredible with just you know inventing new algorithms and making nonstop progress on that front so these are some of the forces powering the rise of deep learning but the good news is that these forces are still working powerfully to make deep learning even better Tech Data society is still throwing up one more digital data or take computation with the rise of specialized hardware like GPUs and faster networking many types of hardware I'm actually quite confident that our ability to do very large neural networks or should a computation point of view will keep on getting better and take algorithms relative learning research communities though continuously phenomenal at innovating on the algorithms front so because of this I think that we can be optimistic answer the optimistic the deep learning will keep on getting better for many years to come so that let's go on to the last video of the section where we'll talk a little bit more about what you learn from this course
About this Course - 2m
0:00
So you're just about to reach the end of the first week of material on the first course in this specialization. Let me give you a quick sense of what you'll learn in the next few weeks as well. As I said in the first video, this specialization comprises five courses. And right now, we're in the first of these five courses which teach you the most important foundations, really the most important building blocks of deep learning. So by the end of this first course, you know how to build and get to work a deep neural network. So here the details of what is in this first course. This course is four weeks of material. And you're just coming up to the end of the first week when you saw an introduction to deep learning. At the end of each week, there are also be 10 multiple-choice questions that you can use to double check your understanding of the material. So when you're done watching this video, I hope you're going to take a look at those questions. In the second week, you then learn about the Basics of Neural Network Programming. You'll learn the structure of what we call the forward propagation and the back propagation steps of the algorithm and how to implement neural networks efficiently. Starting from the second week, you also get to do a programming exercise that lets you practice the material you've just learned, implement the algorithms yourself and see it work for yourself. I find it really satisfying when I learn about algorithm and they get it coded up and I see it worked for myself. So I hope you enjoy that too. Having learned the framework for neural network programming in the third week, you code up a single hidden layer neural network. All right. So you learn about all the key concepts needed to implement and get to work in neural network. And then finally in week four, you build a deep neural network and neural network with many layers and see it worked for yourself. So, congratulations on finishing the videos after this one. I hope that you now have a good high-level sense of what's happening in deep learning. And perhaps some of you are also assigned to, has some ideas of where you might want to apply deep learning yourself. So, I hope that after this video, you go on to take a look at the 10 multiple choice questions that follow this video on the course website and just use the 10 multiple choice questions to check your understanding. And don't review, you don't get all the answers right the first time, you can try again and again until you get them all right. I found them useful to make sure that I'm understanding all the concepts, I hope you're that way too. So with that, congrats again for getting up to here and I look forward to seeing you in the week two videos.
Course Resources - 1m
0:00
I hope you enjoyed this course and to help you complete it I want to make sure that there are few course resources that you know about first if you have any questions or you want to discuss anything with the classmates or the teaching staff including me or if you want to file a bug report the best place to do that is the discussion forum the teaching staff and I will be monitoring that regularly and this is also a good place for you to get answers to your questions from your classmates or if you wish try to answer your classmates questions to get to the discussion forum from this course home page if you look at this menu bar on the left you also might look a bit different than mine but they'll do this discussion forum tab which gives you click on gives you to the discussion forum the best way to ask questions is on the discussion forum but it for some reason you need to contact us directly or let us know about some problem feel free also to email us at this email address I promise we will read every email and we'll try to address commonly occurring issues although depending on the email volume I can't guarantee that we'll be able to reply promptly to every email but we will read every email than you send next I know that there are other companies that wish to train maybe large numbers of employees with deep learning if you're responsible for employee training in your company and would like to train a hundred or more employees with deep learning expertise please feel free to get in touch at this email and we'll see if we can help you we're just in the early phases of developing the university academic program but if you're a university instructor or a university administrator interested in offering a deep learning course at your university please feel free to contact us as this email address though I hope that gives you more resources to complete the course maybe I'll see some of you in the discussion forums and best of luck
(Optional) Heroes of Deep Learning - Geoffrey Hinton interview - 40m
0:00
As part of this course by deeplearning.ai, hope to not just teach you the technical ideas in deep learning, but also introduce you to some of the people, some of the heroes in deep learning. The people that invented so many of these ideas that you learn about in this course or in this specialization. In these videos, I hope to also ask these leaders of deep learning to give you career advice for how you can break into deep learning, for how you can do research or find a job in deep learning. As the first of this interview series, I am delighted to present to you an interview with Geoffrey Hinton.
0:38
Welcome Geoff, and thank you for doing this interview with deeplearning.ai. >> Thank you for inviting me. >> I think that at this point you more than anyone else on this planet has invented so many of the ideas behind deep learning. And a lot of people have been calling you the godfather of deep learning. Although it wasn't until we were chatting a few minutes ago, until I realized you think I'm the first one to call you that, which I'm quite happy to have done.
1:06
But what I want to ask is, many people know you as a legend, I want to ask about your personal story behind the legend. So how did you get involved in, going way back, how did you get involved in AI and machine learning and neural networks?
1:22
So when I was at high school, I had a classmate who was always better than me at everything, he was a brilliant mathematician. And he came into school one day and said, did you know the brain uses holograms?
1:38
And I guess that was about 1966, and I said, sort of what's a hologram? And he explained that in a hologram you can chop off half of it, and you still get the whole picture. And that memories in the brain might be distributed over the whole brain. And so I guess he'd read about Lashley's experiments, where you chop off bits of a rat's brain and discover that it's very hard to find one bit where it stores one particular memory.
2:04
So that's what first got me interested in how does the brain store memories.
2:10
And then when I went to university, I started off studying physiology and physics.
2:16
I think when I was at Cambridge, I was the only undergraduate doing physiology and physics.
2:21
And then I gave up on that and tried to do philosophy, because I thought that might give me more insight. But that seemed to me actually lacking in ways of distinguishing when they said something false. And so then I switched to psychology.
2:41
And in psychology they had very, very simple theories, and it seemed to me it was sort of hopelessly inadequate to explaining what the brain was doing. So then I took some time off and became a carpenter. And then I decided that I'd try AI, and went of to Edinburgh, to study AI with Langer Higgins. And he had done very nice work on neural networks, and he'd just given up on neural networks, and been very impressed by Winograd's thesis. So when I arrived he thought I was kind of doing this old fashioned stuff, and I ought to start on symbolic AI. And we had a lot of fights about that, but I just kept on doing what I believed in. >> And then what? >> I eventually got a PhD in AI, and then I couldn't get a job in Britain. But I saw this very nice advertisement for Sloan Fellowships in California, and I managed to get one of those. And I went to California, and everything was different there. So in Britain, neural nets was regarded as kind of silly, and in California, Don Norman and David Rumelhart were very open to ideas about neural nets. It was the first time I'd been somewhere where thinking about how the brain works, and thinking about how that might relate to psychology, was seen as a very positive thing. And it was a lot of fun there, in particular collaborating with David Rumelhart was great. >> I see, great. So this was when you were at UCSD, and you and Rumelhart around what, 1982, wound up writing the seminal backprop paper, right? >> Actually, it was more complicated than that. >> What happened? >> In, I think, early 1982, David Rumelhart and me, and Ron Williams, between us developed the backprop algorithm, it was mainly David Rumelhart's idea. We discovered later that many other people had invented it. David Parker had invented, it probably after us, but before we'd published. Paul Werbos had published it already quite a few years earlier, but nobody paid it much attention. And there were other people who'd developed very similar algorithms, it's not clear what's meant by backprop. But using the chain rule to get derivatives was not a novel idea. >> I see, why do you think it was your paper that helped so much the community latch on to backprop? It feels like your paper marked an infection in the acceptance of this algorithm, whoever accepted it. >> So we managed to get a paper into Nature in 1986. And I did quite a lot of political work to get the paper accepted. I figured out that one of the referees was probably going to be Stuart Sutherland, who was a well known psychologist in Britain. And I went to talk to him for a long time, and explained to him exactly what was going on. And he was very impressed by the fact that we showed that backprop could learn representations for words. And you could look at those representations, which are little vectors, and you could understand the meaning of the individual features. So we actually trained it on little triples of words about family trees, like Mary has mother Victoria. And you'd give it the first two words, and it would have to predict the last word. And after you trained it, you could see all sorts of features in the representations of the individual words. Like the nationality of the person there, what generation they were, which branch of the family tree they were in, and so on. That was what made Stuart Sutherland really impressed with it, and I think that's why the paper got accepted. >> Very early word embeddings, and you're already seeing learned features of semantic meanings emerge from the training algorithm. >> Yes, so from a psychologist's point of view, what was interesting was it unified two completely different strands of ideas about what knowledge was like. So there was the old psychologist's view that a concept is just a big bundle of features, and there's lots of evidence for that. And then there was the AI view of the time, which is a formal structurist view. Which was that a concept is how it relates to other concepts. And to capture a concept, you'd have to do something like a graph structure or maybe a semantic net. And what this back propagation example showed was, you could give it the information that would go into a graph structure, or in this case a family tree.
7:22
And it could convert that information into features in such a way that it could then use the features to derive new consistent information, ie generalize. But the crucial thing was this to and fro between the graphical representation or the tree structured representation of the family tree, and a representation of the people as big feature vectors. And in fact that from the graph-like representation you could get feature vectors. And from the feature vectors, you could get more of the graph-like representation. >> So this is 1986? In the early 90s, Bengio showed that you can actually take real data, you could take English text, and apply the same techniques there, and get embeddings for real words from English text, and that impressed people a lot. >> I guess recently we've been talking a lot about how fast computers like GPUs and supercomputers that's driving deep learning. I didn't realize that back between 1986 and the early 90's, it sounds like between you and Benjio there was already the beginnings of this trend.
8:30
Yes, it was a huge advance. In 1986, I was using a list machine which was less than a tenth of a mega flop. And by about 1993 or thereabouts, people were seeing ten mega flops. >> I see. >> So there was a factor of 100, and that's the point at which is was easy to use, because computers were just getting faster. >> Over the past several decades, you've invented so many pieces of neural networks and deep learning. I'm actually curious, of all of the things you've invented, which of the ones you're still most excited about today?
9:06
So I think the most beautiful one is the work I do with Terry Sejnowski on Boltzmann machines. So we discovered there was this really, really simple learning algorithm that applied to great big density connected nets where you could only see a few of the nodes. So it would learn hidden representations and it was a very simple algorithm. And it looked like the kind of thing you should be able to get in a brain because each synapse only needed to know about the behavior of the two neurons it was directly connected to.
9:37
And the information that was propagated was the same. There were two different phases, which we called wake and sleep. But in the two different phases, you're propagating information in just the same way. Where as in something like back propagation, there's a forward pass and a backward pass, and they work differently. They're sending different kinds of signals.
9:58
So I think that's the most beautiful thing. And for many years it looked just like a curiosity, because it looked like it was much too slow.
10:06
But then later on, I got rid of a little bit of the beauty, and it started letting me settle down and just use one iteration, in a somewhat simpler net. And that gave restricted Boltzmann machines, which actually worked effectively in practice. So in the Netflix competition, for example, restricted Boltzmann machines were one of the ingredients of the winning entry. >> And in fact, a lot of the recent resurgence of neural net and deep learning, starting about 2007, was the restricted Boltzmann machine, and derestricted Boltzmann machine work that you and your lab did.
10:38
Yes so that's another of the pieces of work I'm very happy with, the idea of that you could train your restricted Boltzmann machine, which just had one layer of hidden features and you could learn one layer of feature. And then you could treat those features as data and do it again, and then you could treat the new features you learned as data and do it again, as many times as you liked. So that was nice, it worked in practice. And then UY Tay realized that the whole thing could be treated as a single model, but it was a weird kind of model. It was a model where at the top you had a restricted Boltzmann machine, but below that you had a Sigmoid belief net which was something that invented many years early. So it was a directed model and what we'd managed to come up with by training these restricted Boltzmann machines was an efficient way of doing inferences in Sigmoid belief nets.
11:33
So, around that time, there were people doing neural nets, who would use densely connected nets, but didn't have any good ways of doing probabilistic imprints in them. And you had people doing graphical models, unlike my children, who could do inference properly, but only in sparsely connected nets. And what we managed to show was the way of learning these deep belief nets so that there's an approximate form of inference that's very fast, it's just hands in a single forward pass and that was a very beautiful result. And you could guarantee that each time you learn that extra layer of features
12:16
there was a band, each time you learned a new layer, you got a new band, and the new band was always better than the old band. >> The variational bands, showing as you add layers. Yes, I remember that video. >> So that was the second thing that I was really excited about. And I guess the third thing was the work I did with on variational methods. It turns out people in statistics had done similar work earlier, but we didn't know about that.
12:44
So we managed to make EN work a whole lot better by showing you didn't need to do a perfect E step. You could do an approximate E step. And EN was a big algorithm in statistics. And we'd showed a big generalization of it. And in particular, in 1993, I guess, with Van Camp. I did a paper, with I think, the first variational Bayes paper, where we showed that you could actually do a version of Bayesian learning that was far more tractable, by approximating the true posterior with a. And you could do that in neural net. And I was very excited by that. >> I see. Wow, right. Yep, I think I remember all of these papers. You and Hinton, approximate Paper, spent many hours reading over that. And I think some of the algorithms you use today, or some of the algorithms that lots of people use almost every day, are what, things like dropouts, or I guess activations came from your group? >> Yes and no. So other people have thought about rectified linear units. And we actually did some work with restricted Boltzmann machines showing that a ReLU was almost exactly equivalent to a whole stack of logistic units. And that's one of the things that helped ReLUs catch on. >> I was really curious about that. The value paper had a lot of math showing that this function can be approximated with this really complicated formula. Did you do that math so your paper would get accepted into an academic conference, or did all that math really influence the development of max of 0 and x?
14:26
That was one of the cases where actually the math was important to the development of the idea. So I knew about rectified linear units, obviously, and I knew about logistic units. And because of the work on Boltzmann machines, all of the basic work was done using logistic units. And so the question was, could the learning algorithm work in something with rectified linear units? And by showing the rectified linear units were almost exactly equivalent to a stack of logistic units, we showed that all the math would go through. >> I see. And it provided the inspiration for today, tons of people use ReLU and it just works without- >> Yeah. >> Without necessarily needing to understand the same motivation.
15:13
Yeah, one thing I noticed later when I went to Google. I guess in 2014, I gave a talk at Google about using ReLUs and initializing with the identity matrix. because the nice thing about ReLUs is that if you keep replicating the hidden layers and you initialize with the identity, it just copies the pattern in the layer below.
15:36
And so I was showing that you could train networks with 300 hidden layers and you could train them really efficiently if you initialize with their identity. But I didn't pursue that any further and I really regret not pursuing that. We published one paper with showing you could initialize an active showing you could initialize recurringness like that. But I should have pursued it further because Later on these residual networks is really that kind of thing. >> Over the years I've heard you talk a lot about the brain. I've heard you talk about relationship being backprop and the brain. What are your current thoughts on that? >> I'm actually working on a paper on that right now.
16:18
I guess my main thought is this. If it turns out the back prop is a really good algorithm for doing learning.
16:26
Then for sure evolution could've figured out how to prevent it.
16:32
I mean you have cells that could turn into either eyeballs or teeth. Now, if cells can do that, they can for sure implement backpropagation and presumably this huge selective pressure for it. So I think the neuroscientist idea that it doesn't look plausible is just silly. There may be some subtle implementation of it. And I think the brain probably has something that may not be exactly be backpropagation, but it's quite close to it. And over the years, I've come up with a number of ideas about how this might work. So in 1987, working with Jay McClelland, I came up with the recirculation algorithm, where the idea is you send information round a loop.
17:17
And you try to make it so that things don't change as information goes around this loop. So the simplest version would be you have input units and hidden units, and you send information from the input to the hidden and then back to the input, and then back to the hidden and then back to the input and so on. And what you want, you want to train an autoencoder, but you want to train it without having to do backpropagation. So you just train it to try and get rid of all variation in the activities. So the idea is that the learning rule for synapse is change the weighting proportion to the presynaptic input and in proportion to the rate of change at the post synaptic input. But in recirculation, you're trying to make the post synaptic input, you're trying to make the old one be good and the new one be bad, so you're changing in that direction.
18:11
We invented this algorithm before neuroscientists come up with spike-timing-dependent plasticity. Spike-timing-dependent plasticity is actually the same algorithm but the other way round, where the new thing is good and the old thing is bad in the learning rule. So you're changing the weighting proportions to the preset outlook activity times the new person outlook activity minus the old one.
18:37
Later on I realized in 2007, that if you took a stack of Restricted Boltzmann machines and you trained it up. After it was trained, you then had exactly the right conditions for implementing backpropagation by just trying to reconstruct. If you looked at the reconstruction era, that reconstruction era would actually tell you the derivative of the discriminative performance. And at the first deep learning workshop at in 2007, I gave a talk about that. That was almost completely ignored. Later on, Joshua Benjo, took up the idea and that's actually done quite a lot of more work on that. And I've been doing more work on it myself. And I think this idea that if you have a stack of autoencoders, then you can get derivatives by sending activity backwards and locate reconstructionaires, is a really interesting idea and may well be how the brain does it. >> One other topic that I know you follow about and that I hear you're still working on is how to deal with multiple time skills in deep learning? So, can you share your thoughts on that? >> Yes, so actually, that goes back to my first years of graduate student. The first talk I ever gave was about using what I called fast weights. So weights that adapt rapidly, but decay rapidly. And therefore can hold short term memory. And I showed in a very simple system in 1973 that you could do true recursion with those weights. And what I mean by true recursion is that the neurons that is used in representing things get re-used for representing things in the recursive core.
20:30
And the weights that is used for actually knowledge get re-used in the recursive core. And so that leads the question of when you pop out your recursive core, how do you remember what it was you were in the middle of doing? Where's that memory? because you used the neurons for the recursive core.
20:46
And the answer is you can put that memory into fast weights, and you can recover the activities neurons from those fast weights. And more recently working with Jimmy Ba, we actually got a paper in it by using fast weights for recursion like that. >> I see. >> So that was quite a big gap. The first model was unpublished in 1973 and then Jimmy Ba's model was in 2015, I think, or 2016. So it's about 40 years later. >> And, I guess, one other idea of Quite a few years now, over five years, I think is capsules, where are you with that? >> Okay, so I'm back to the state I'm used to being in. Which is I have this idea I really believe in and nobody else believes it. And I submit papers about it and they would get rejected. But I really believe in this idea and I'm just going to keep pushing it. So it hinges on, there's a couple of key ideas. One is about how you represent multi dimensional entities, and you can represent multi-dimensional entities by just a little backdoor activities. As long as you know there's any one of them. So the idea is in each region of the image, you'll assume there's at most, one of the particular kind of feature.
22:15
And then you'll use a bunch of neurons, and their activities will represent the different aspects to that feature,
22:24
like within that region exactly what are its x and y coordinates? What orientation is it at? How fast is it moving? What color is it? How bright is it? And stuff like that. So you can use a whole bunch of neurons to represent different dimensions of the same thing. Provided there's only one of them.
22:40
That's a very different way of doing representation from what we're normally used to in neuronettes. Normally in neuronettes, we just have a great big layer, and all the units go off and do whatever they do. But you don't think of bundling them up into little groups that represent different coordinates of the same thing.
22:58
So I think we should beat this extra structure. And then the other idea that goes with that. >> So this means in the truth of the representation, you partition the representation. >> Yes. >> To different subsets. >> Yes. >> To represent, right, rather than- >> I call each of those subsets a capsule. >> I see. >> And the idea is a capsule is able to represent an instance of a feature, but only one. And it represents all the different properties of that feature. It's a feature that has a lot of properties as opposed to a normal neuron and a normal neuronette, which has just one scale of property. >> Yeah, I see yep. >> And then what you can do if you've got that, is you can do something that normal neuronettes are very bad at, which is you can do what I call routine by agreement. So let's suppose you want to do segmentation and you have something that might be a mouth and something else that might be a nose.
23:57
And you want to know if you should put them together to make one thing. So the idea should have a capsule for a mouth that has the parameters of the mouth. And you have a capsule for a nose that has the parameters of the nose. And then to decipher whether to put them together or not, you get each of them to vote for what the parameters should be for a face.
24:19
Now if the mouth and the nose are in the right spacial relationship, they will agree. So when you get two captures at one level voting for the same set of parameters at the next level up, you can assume they're probably right, because agreement in a high dimensional space is very unlikely.
24:36
And that's a very different way of doing filtering, than what we normally use in neural nets. So I think this routing by agreement is going to be crucial for getting neural nets to generalize much better from limited data. I think it'd be very good at getting the changes in viewpoint, very good at doing segmentation. And I'm hoping it will be much more statistically efficient than what we currently do in neural nets. Which is, if you want to deal with changes in viewpoint, you just give it a whole bunch of changes in view point and training on them all. >> I see, right, so rather than FIFO learning, supervised learning, you can learn this in some different way.
25:20
Well, I still plan to do it with supervised learning, but the mechanics of the forward paths are very different. It's not a pure forward path in the sense that there's little bits of iteration going on, where you think you found a mouth and you think you found a nose. And use a little bit of iteration to decide whether they should really go together to make a face. And you can do back props from that iteration. So you can try and do it a little discriminatively, and we're working on that now at my group in Toronto. So I now have a little Google team in Toronto, part of the Brain team. That's what I'm excited about right now. >> I see, great, yeah. Look forward to that paper when that comes out. >> Yeah, if it comes out [LAUGH]. >> You worked in deep learning for several decades. I'm actually really curious, how has your thinking, your understanding of AI changed over these years?
26:20
So I guess a lot of my intellectual history has been around back propagation, and how to use back propagation, how to make use of its power. So to begin with, in the mid 80s, we were using it for discriminative learning and it was working well. I then decided, by the early 90s, that actually most human learning was going to be unsupervised learning. And I got much more interested in unsupervised learning, and that's when I worked on things like the Wegstein algorithm. >> And your comments at that time really influenced my thinking as well. So when I was leading Google Brain, our first project spent a lot of work in unsupervised learning because of your influence. >> Right, and I may have misled you. Because in the long run, I think unsupervised learning is going to be absolutely crucial.
27:15
But you have to sort of face reality. And what's worked over the last ten years or so is supervised learning. Discriminative training, where you have labels, or you're trying to predict the next thing in the series, so that acts as the label. And that's worked incredibly well.
27:37
I still believe that unsupervised learning is going to be crucial, and things will work incredibly much better than they do now when we get that working properly, but we haven't yet.
27:49
Yeah, I think many of the senior people in deep learning, including myself, remain very excited about it. It's just none of us really have almost any idea how to do it yet. Maybe you do, I don't feel like I do. >> Variational altering code is where you use the reparameterization tricks. Seemed to me like a really nice idea. And generative adversarial nets also seemed to me to be a really nice idea. I think generative adversarial nets are one of the sort of biggest ideas in deep learning that's really new. I'm hoping I can make capsules that successful, but right now generative adversarial nets, I think, have been a big breakthrough. >> What happened to sparsity and slow features, which were two of the other principles for building unsupervised models?
28:41
I was never as big on sparsity as you were, buddy. But slow features, I think, is a mistake. You shouldn't say slow. The basic idea is right, but you shouldn't go for features that don't change, you should go for features that change in predictable ways.
29:01
So here's a sort of basic principle about how you model anything.
29:08
You take your measurements, and you're applying nonlinear transformations to your measurements until you get to a representation as a state vector in which the action is linear. So you don't just pretend it's linear like you do with common filters. But you actually find a transformation from the observables to the underlying variables where linear operations, like matrix multipliers on the underlying variables, will do the work. So for example, if you want to change viewpoints. If you want to produce the image from another viewpoint, what you should do is go from the pixels to coordinates.
29:47
And once you got to the coordinate representation, which is a kind of thing I'm hoping captures will find. You can then do a matrix multiplier to change viewpoint, and then you can map it back to pixels. >> Right, that's why you did all that. >> I think that's a very, very general principle. >> That's why you did all that work on face synthesis, right? Where you take a face and compress it to very low dimensional vector, and so you can fiddle with that and get back other faces. >> I had a student who worked on that, I didn't do much work on that myself.
30:17
Now I'm sure you still get asked all the time, if someone wants to break into deep learning, what should they do? So what advice would you have? I'm sure you've given a lot of advice to people in one on one settings, but for the global audience of people watching this video. What advice would you have for them to get into deep learning? >> Okay, so my advice is sort of read the literature, but don't read too much of it. So this is advice I got from my advisor, which is very unlike what most people say. Most people say you should spend several years reading the literature and then you should start working on your own ideas. And that may be true for some researchers, but for creative researchers I think what you want to do is read a little bit of the literature. And notice something that you think everybody is doing wrong, I'm contrary in that sense. You look at it and it just doesn't feel right. And then figure out how to do it right.
31:16
And then when people tell you, that's no good, just keep at it. And I have a very good principle for helping people keep at it, which is either your intuitions are good or they're not. If your intuitions are good, you should follow them and you'll eventually be successful. If your intuitions are not good, it doesn't matter what you do. >> I see [LAUGH]. Inspiring advice, might as well go for it. >> You might as well trust your intuitions. There's no point not trusting them. >> I see, yeah. I usually advise people to not just read, but replicate published papers. And maybe that puts a natural limiter on how many you could do, because replicating results is pretty time consuming.
32:01
Yes, it's true that when you're trying to replicate a published you discover all over little tricks necessary to make it work. The other advice I have is, never stop programming. Because if you give a student something to do, if they're botching, they'll come back and say, it didn't work. And the reason it didn't work would be some little decision they made, that they didn't realize is crucial. And if you give it to a good student, like for example. You can give him anything and he'll come back and say, it worked.
32:32
I remember doing this once, and I said, but wait a minute. Since we last talked, I realized it couldn't possibly work for the following reason. And said, yeah, I realized that right away, so I assumed you didn't mean that. >> [LAUGH] I see, yeah, that's great, yeah. Let's see, any other advice for people that want to break into AI and deep learning? >> I think that's basically, read enough so you start developing intuitions. And then, trust your intuitions and go for it, don't be too worried if everybody else says it's nonsense. >> And I guess there's no way to know if others are right or wrong when they say it's nonsense, but you just have to go for it, and then find out. >> Right, but there is one thing, which is, if you think it's a really good idea, and other people tell you it's complete nonsense, then you know you're really on to something. So one example of that is when and I first came up with variational methods.
33:35
I sent mail explaining it to a former student of mine called Peter Brown, who knew a lot about.
33:43
And he showed it to people who worked with him, called the brothers, they were twins, I think. And he then told me later what they said, and they said, either this guy's drunk, or he's just stupid, so they really, really thought it was nonsense. Now, it could have been partly the way I explained it, because I explained it in intuitive terms.
34:09
But when you have what you think is a good idea and other people think is complete rubbish, that's the sign of a really good idea.
34:18
I see, and research topics, new grad students should work on capsules and maybe unsupervised learning, any other? >> One good piece of advice for new grad students is, see if you can find an advisor who has beliefs similar to yours. Because if you work on stuff that your advisor feels deeply about, you'll get a lot of good advice and time from your advisor. If you work on stuff your advisor's not interested in, all you'll get is, you get some advice, but it won't be nearly so useful. >> I see, and last one on advice for learners, how do you feel about people entering a PhD program? Versus joining a top company, or a top research group? >> Yeah, it's complicated, I think right now, what's happening is, there aren't enough academics trained in deep learning to educate all the people that we need educated in universities. There just isn't the faculty bandwidth there, but I think that's going to be temporary. I think what's happened is, most departments have been very slow to understand the kind of revolution that's going on. I kind of agree with you, that it's not quite a second industrial revolution, but it's something on nearly that scale. And there's a huge sea change going on, basically because our relationship to computers has changed. Instead of programming them, we now show them, and they figure it out. That's a completely different way of using computers, and computer science departments are built around the idea of programming computers. And they don't understand that sort of,
36:05
this showing computers is going to be as big as programming computers. Except they don't understand that half the people in the department should be people who get computers to do things by showing them. So my department refuses to acknowledge that it should have lots and lots of people doing this. They think they got a couple, maybe a few more, but not too many.
36:31
And in that situation, you have to remind the big companies to do quite a lot of the training. So Google is now training people, we call brain residence, I suspect the universities will eventually catch up. >> I see, right, in fact, maybe a lot of students have figured this out. A lot of top 50 programs, over half of the applicants are actually wanting to work on showing, rather than programming. Yeah, cool, yeah, in fact, to give credit where it's due, whereas a deep learning AI is creating a deep learning specialization. As far as I know, their first deep learning MOOC was actually yours taught on Coursera, back in 2012, as well.
37:12
And somewhat strangely, that's when you first published the RMS algorithm, which also is a rough.
37:20
Right, yes, well, as you know, that was because you invited me to do the MOOC. And then when I was very dubious about doing, you kept pushing me to do it, so it was very good that I did, although it was a lot of work. >> Yes, and thank you for doing that, I remember you complaining to me, how much work it was. And you staying out late at night, but I think many, many learners have benefited for your first MOOC, so I'm very grateful to you for it, so. >> That's good, yeah >> Yeah, over the years, I've seen you embroiled in debates about paradigms for AI, and whether there's been a paradigm shift for AI. What are your, can you share your thoughts on that? >> Yes, happily, so I think that in the early days, back in the 50s, people like von Neumann and didn't believe in symbolic AI, they were far more inspired by the brain. Unfortunately, they both died much too young, and their voice wasn't heard. And in the early days of AI, people were completely convinced that the representations you need for intelligence were symbolic expressions of some kind. Sort of cleaned up logic, where you could do nomeratonic things, and not quite logic, but something like logic, and that the essence of intelligence was reasoning. What's happened now is, there's a completely different view, which is that what a thought is, is just a great big vector of neural activity, so contrast that with a thought being a symbolic expression. And I think the people who thought that thoughts were symbolic expressions just made a huge mistake.
39:01
What comes in is a string of words, and what comes out is a string of words.
39:08
And because of that, strings of words are the obvious way to represent things. So they thought what must be in between was a string of words, or something like a string of words. And I think what's in between is nothing like a string of words. I think the idea that thoughts must be in some kind of language is as silly as the idea that understanding the layout of a spatial scene must be in pixels, pixels come in. And if we could, if we had a dot matrix printer attached to us, then pixels would come out, but what's in between isn't pixels.
39:43
And so I think thoughts are just these great big vectors, and that big vectors have causal powers. They cause other big vectors, and that's utterly unlike the standard AI view that thoughts are symbolic expressions. >> I see, good,
39:57
I guess AI is certainly coming round to this new point of view these days. >> Some of it, I think a lot of people in AI still think thoughts have to be symbolic expressions. >> Thank you very much for doing this interview. It was fascinating to hear how deep learning has evolved over the years, as well as how you're still helping drive it into the future, so thank you, Jeff. >> Well, thank you for giving me this opportunity. >> Thank you.
Neural Networks Basics
Learn to set up a machine learning problem with a neural network mindset. Learn to use vectorization to speed up your models.
Binary Classification - 8m
0:00
Hello, and welcome back. In this week we're going to go over the basics of neural network programming. It turns out that when you implement a neural network there are some techniques that are going to be really important. For example, if you have a training set of m training examples, you might be used to processing the training set by having a four loop step through your m training examples. But it turns out that when you're implementing a neural network, you usually want to process your entire training set without using an explicit four loop to loop over your entire training set. So, you'll see how to do that in this week's materials. Another idea, when you organize the computation of, in your network, usually you have what's called a forward pause or forward propagation step, followed by a backward pause or what's called a backward propagation step. And so in this week's materials, you also get an introduction about why the computations, in learning an neural network can be organized in this for propagation and a separate backward propagation.
1:09
For this week's materials I want to convey these ideas using logistic regression in order to make the ideas easier to understand. But even if you've seen logistic regression before, I think that there'll be some new and interesting ideas for you to pick up in this week's materials. So with that, let's get started. Logistic regression is an algorithm for binary classification. So let's start by setting up the problem. Here's an example of a binary classification problem. You might have an input of an image, like that, and want to output a label to recognize this image as either being a cat, in which case you output 1, or not-cat in which case you output 0, and we're going to use y to denote the output label. Let's look at how an image is represented in a computer. To store an image your computer stores three separate matrices corresponding to the red, green, and blue color channels of this image.
2:10
So if your input image is 64 pixels by 64 pixels, then you would have 3 64 by 64 matrices corresponding to the red, green and blue pixel intensity values for your images. Although to make this little slide I drew these as much smaller matrices, so these are actually 5 by 4 matrices rather than 64 by 64. So to turn these pixel intensity values- Into a feature vector, what we're going to do is unroll all of these pixel values into an input feature vector x. So to unroll all these pixel intensity values into Feature vector, what we're going to do is define a feature vector x corresponding to this image as follows. We're just going to take all the pixel values 255, 231, and so on. 255, 231, and so on until we've listed all the red pixels. And then eventually 255 134 255, 134 and so on until we get a long feature vector listing out all the red, green and blue pixel intensity values of this image. If this image is a 64 by 64 image, the total dimension of this vector x will be 64 by 64 by 3 because that's the total numbers we have in all of these matrixes. Which in this case, turns out to be 12,288, that's what you get if you multiply all those numbers. And so we're going to use nx=12288 to represent the dimension of the input features x. And sometimes for brevity, I will also just use lowercase n to represent the dimension of this input feature vector. So in binary classification, our goal is to learn a classifier that can input an image represented by this feature vector x. And predict whether the corresponding label y is 1 or 0, that is, whether this is a cat image or a non-cat image. Let's now lay out some of the notation that we'll use throughout the rest of this course. A single training example is represented by a pair, (x,y) where x is an x-dimensional feature vector and y, the label, is either 0 or 1. Your training sets will comprise lower-case m training examples. And so your training sets will be written (x1, y1) which is the input and output for your first training example (x(2), y(2)) for the second training example up to <xm, ym) which is your last training example. And then that altogether is your entire training set. So I'm going to use lowercase m to denote the number of training samples. And sometimes to emphasize that this is the number of train examples, I might write this as M = M train. And when we talk about a test set, we might sometimes use m subscript test to denote the number of test examples. So that's the number of test examples. Finally, to output all of the training examples into a more compact notation, we're going to define a matrix, capital X. As defined by taking you training set inputs x1, x2 and so on and stacking them in columns. So we take X1 and put that as a first column of this matrix, X2, put that as a second column and so on down to Xm, then this is the matrix capital X. So this matrix X will have M columns, where M is the number of train examples and the number of railroads, or the height of this matrix is NX. Notice that in other causes, you might see the matrix capital X defined by stacking up the train examples in rows like so, X1 transpose down to Xm transpose. It turns out that when you're implementing neural networks using this convention I have on the left, will make the implementation much easier. So just to recap, x is a nx by m dimensional matrix, and when you implement this in Python, you see that x.shape, that's the python command for finding the shape of the matrix, that this an nx, m. That just means it is an nx by m dimensional matrix. So that's how you group the training examples, input x into matrix. How about the output labels Y? It turns out that to make your implementation of a neural network easier, it would be convenient to also stack Y In columns. So we're going to define capital Y to be equal to Y 1, Y 2, up to Y m like so. So Y here will be a 1 by m dimensional matrix. And again, to use the notation without the shape of Y will be 1, m. Which just means this is a 1 by m matrix. And as you influence your new network, mtrain discourse, you find that a useful convention would be to take the data associated with different training examples, and by data I mean either x or y, or other quantities you see later. But to take the stuff or the data associated with different training examples and to stack them in different columns, like we've done here for both x and y.
7:58
So, that's a notation we we'll use e for a regression and for neural networks networks later in this course. If you ever forget what a piece of notation means, like what is M or what is N or what is something else, we've also posted on the course website a notation guide that you can use to quickly look up what any particular piece of notation means. So with that, let's go on to the next video where we'll start to fetch out logistic regression using this notation.
Logistic Regression - 5m
0:00
In this video, we'll go over logistic regression. This is a learning algorithm that you use when the output labels Y in a supervised learning problem are all either zero or one, so for binary classification problems. Given an input feature vector X maybe corresponding to an image that you want to recognize as either a cat picture or not a cat picture, you want an algorithm that can output a prediction, which we'll call Y hat, which is your estimate of Y. More formally, you want Y hat to be the probability of the chance that, Y is equal to one given the input features X. So in other words, if X is a picture, as we saw in the last video, you want Y hat to tell you, what is the chance that this is a cat picture? So X, as we said in the previous video, is an X dimensional vector, given that the parameters of logistic regression will be W which is also an X dimensional vector, together with b which is just a real number. So given an input X and the parameters W and b, how do we generate the output Y hat? Well, one thing you could try, that doesn't work, would be to have Y hat be w transpose X plus B, kind of a linear function of the input X. And in fact, this is what you use if you were doing linear regression. But this isn't a very good algorithm for binary classification because you want Y hat to be the chance that Y is equal to one. So Y hat should really be between zero and one, and it's difficult to enforce that because W transpose X plus B can be much bigger than one or it can even be negative, which doesn't make sense for probability. That you want it to be between zero and one. So in logistic regression, our output is instead going to be Y hat equals the sigmoid function applied to this quantity. This is what the sigmoid function looks like. If on the horizontal axis I plot Z, then the function sigmoid of Z looks like this. So it goes smoothly from zero up to one. Let me label my axes here, this is zero and it crosses the vertical axis as 0.5. So this is what sigmoid of Z looks like. And we're going to use Z to denote this quantity, W transpose X plus B. Here's the formula for the sigmoid function. Sigmoid of Z, where Z is a real number, is one over one plus E to the negative Z. So notice a couple of things. If Z is very large, then E to the negative Z will be close to zero. So then sigmoid of Z will be approximately one over one plus something very close to zero, because E to the negative of very large number will be close to zero. So this is close to 1. And indeed, if you look in the plot on the left, if Z is very large the sigmoid of Z is very close to one. Conversely, if Z is very small, or it is a very large negative number, then sigmoid of Z becomes one over one plus E to the negative Z, and this becomes, it's a huge number. So this becomes, think of it as one over one plus a number that is very, very big, and so, that's close to zero. And indeed, you see that as Z becomes a very large negative number, sigmoid of Z goes very close to zero. So when you implement logistic regression, your job is to try to learn parameters W and B so that Y hat becomes a good estimate of the chance of Y being equal to one. Before moving on, just another note on the notation. When we programmed neural networks, we'll usually keep the parameter W and parameter B separate, where here, B corresponds to an inter-spectrum. In some other courses, you might have seen a notation that handles this differently. In some conventions you define an extra feature called X0 and that equals a one. So that now X is in R of NX plus one. And then you define Y hat to be equal to sigma of theta transpose X. In this alternative notational convention, you have vector parameters theta, theta zero, theta one, theta two, down to theta NX And so, theta zero, place a row a B, that's just a real number, and theta one down to theta NX play the role of W. It turns out, when you implement your neural network, it will be easier to just keep B and W as separate parameters. And so, in this class, we will not use any of this notational convention that I just wrote in red. If you've not seen this notation before in other courses, don't worry about it. It's just that for those of you that have seen this notation I wanted to mention explicitly that we're not using that notation in this course. But if you've not seen this before, it's not important and you don't need to worry about it. So you have now seen what the logistic regression model looks like. Next to change the parameters W and B you need to define a cost function. Let's do that in the next video.
Logistic Regression Cost Function - 8m
0:00
In a previous video, you saw the logistic regression model. To train the parameters W and B of the logistic regression model, you need to define a cost function. Let's take a look at the cost function you can use to train logistic regression. To recap, this is what we had to find from the previous slide. So your output y-hat is sigmoid of w transpose x plus b where a sigmoid of Z is as defined here. So to learn parameters for your model you're given a training set of m training examples and it seems natural that you want to find parameters W and B so that at least on the training set, the outputs you have. The predictions you have on the training set, which we only write as y-hat (i) that that will be close to the ground truth labels y_i that you got in the training set. So to throw in a little bit more detail for the equation on top, we had said that y-hat is as defined at the top for a training example x and of course for each training example, we're using these superscripts with round brackets with parentheses to index and to differentiate examples. Your prediction on training sample (i) which is y-hat (i) is going to be obtained by taking the sigmoid function and applying it to W transpose X, (i) the input that the training example plus V and you can also define Z (i) as follows. Z (i) is equal to the W transpose x (i) plus b. So throughout this course, we're going to use this notational convention, that the superscript parentheses i refers to data. X or Y or Z or something else associated with the i-th training example, associated with the i-th example. That's what the superscript i in parentheses means. Now, let's see what loss function or error function we can use to measure how well our algorithm is doing. One thing you could do is define the loss when your algorithm outputs y-hat and the true label as Y to be maybe the square error or one half a square error. It turns out that you could do this, but in logistic regression people don't usually do this because when you come to learn the parameters, you find that the optimization problem which we talk about later becomes non-convex. So you end up with optimization problem with multiple local optima. So gradient descent may not find the global optimum. If you didn't understand the last couple of comments. Don't worry about it, we'll get to it in later video. But the intuition to take away is that this function L called the loss function is a function you'll need to define to measure how good our output y-hat is when the true label is y. As square error seems like it might be a reasonable choice except that it makes gradient descent not work well. So in logistic regression, we will actually define a different loss function that plays a similar role as squared error, that will give us an optimization problem that is convex and so we'll see in that later video becomes much easier to optimize. So, what we use in logistic regression is actually the following loss function which I'm just like right up here, is negative y log y-hat plus one line has y log, one line is y-hat. Here's some intuition for why this loss function makes sense. Keep in mind that if we're using squared error then you want the squared error to be as small as possible. And with this logistic regression loss function, we'll also want this to be as small as possible. To understand why this makes sense, let's look at the two cases. In the first case, let's say Y is equal to one then the loss function y-hat comma y is justice for us to write this negative sign. So this negative log y-hat. If y is equal to one. Because if y equals one then the second term one minus Y is equal to zero. So this says if y equals one you want negative log y-hat to be as big as possible. So that means you want log y-hat to be large, to be as big as possible and that means you want y-hat to be large. But because y-hat is you know, the sigmoid function, it can never be bigger than one. So this is saying that if y is equal to one you, want y-hat to be as big as possible. But it can't ever be bigger than one so saying you want y-hat to be close to one as well. The other case is if y equals zero. If y equals zero then this first term in the loss function is equal to zero because y zero and then the second term defines the loss function. So the loss becomes negative log one minus y-hat. And so if in your learning procedure you try to make the loss function small, what this means is that you want log one minus y-hat to be large. And because it's a negative sign there and then through a similar piece of reason you can conclude that this loss function is trying to make y-hat as small as possible. And again because y-hat has to be between zero and one. This is saying that if y is equal to zero then your loss function will push the parameters to make y-hat as close to zero as possible. Now, there are a lot of functions with Rafidah's effect that if y is equal to one we try to make y-hat large and if Y is equal to zero we try to make y-hat small. We just gave here in green a somewhat informal justification for this loss function will provide an optional video later to give a more formal justification for why in logistic regression we like to use the loss function with this particular form. Finally, the loss function was defined with respect to a single training example. It measures how well you're doing on a single training example. I'm now going to define something called the cost function, which measures how well you're doing an entire training set. So the cost function J which is applied to your parameters W and B is going to be the average with one of the m of the sum of the loss function applied to each of the training examples and turn. While here y-hat is of course the prediction output by your logistic regression algorithm using you know, a particular set of parameters W and B. And so just to expand this out, this is equal to negative one over m sum from i equals one through m of the definition of the loss function. So this is y (i) Log y-hat (i) plus one line is y (i) log one line is y-hat (i). I guess I could put square brackets here. So the minus sign is outside everything else. So the terminology I'm going to use is that the loss function is applied to just a single training example like so. And the cost function is the cost of your parameters. So in training your logistic regression model, we're going to try to find parameters W and B that minimize the overall costs of machine J written at the bottom. So, you've just seen the set up for the logistic regression algorithm, the loss function for training example and the overall cost function for the parameters of your algorithm. It turns out that logistic regression can be viewed as a very very small neural network. In the next video we'll go over that so you can start gaining intuition about what neural networks do. So that let's go onto the next video about how to view logistic regression as a very small neural network.
Gradient Descent - 11m
0:00
You've seen the logistic regression model. You've seen the loss function that measures how well you're doing on the single training example. You've also seen the cost function that measures how well your parameters w and b are doing on your entire training set. Now let's talk about how you can use the gradient descent algorithm to train, or to learn, the parameters w and b on your training set. To recap, here is the familiar logistic regression algorithm.
0:31
And we have on the second line the cost function, J, which is a function of your parameters w and b. And that's defined as the average. So it's 1 over m times the sum of this loss function. And so the loss function measures how well your algorithms outputs y-hat(i) on each of the training examples stacks up or compares to the ground true label y(i) on each of the training examples. And the full formula is expanded out on the right. So the cost function measures how well your parameters w and b are doing on the training set. So in order to learn the set of parameters w and b it seems natural that we want to find w and b that make the cost function J(w, b) as small as possible. So here's an illustration of gradient descent. In this diagram the horizontal axes represent your spatial parameters, w and b. In practice, w can be much higher dimensional, but for the purposes of plotting, let's illustrate w as a single real number and b as a single real number. The cost function J(w,b,) is, then, some surface above these horizontal axes w and b. So the height of the surface represents the value of J(w,b) at a certain point. And what we want to do is really to find the value of w and b that corresponds to the minimum of the cost function J.
2:00
It turns out that this cost function J is a convex function. So it's just a single big bowl, so this is a convex function and this is opposed to functions that look like this, which are non-convex and has lots of different local. So the fact that our cost function J(w,b) as defined here is convex is one of the huge reasons why we use this particular cost function, J, for logistic regression. So to find a good value for the parameters, what we'll do is initialize w and b to some initial value, maybe denoted by that little red dot. And for logistic regression almost any initialization method works, usually you initialize the value to zero. Random initialization also works, but people don't usually do that for logistic regression. But because this function is convex, no matter where you initialize, you should get to the same point or roughly the same point. And what gradient descent does is it starts at that initial point and then takes a step in the steepest downhill direction. So after one step of gradient descent you might end up there, because it's trying to take a step downhill in the direction of steepest descent or as quickly downhill as possible. So that's one iteration of gradient descent. And after two iterations of gradient descent you might step there, three iterations and so on. I guess this is now hidden by the back of the plot until eventually, hopefully you converge to this global optimum or get to something close to the global optimum. So this picture illustrates the gradient descent algorithm. Let's write a bit more of the details. For the purpose of illustration, let's say that there's some function, J(w), that you want to minimize, and maybe that function looks like this. To make this easier to draw, I'm going to ignore b for now, just to make this a one-dimensional plot instead of a high-dimensional plot. So gradient descent does this, we're going to repeatedly carry out the following update. Were going to take the value of w and update it, going to use colon equals to represent updating w. So set w to w minus alpha, times, and this is a derivative dJ(w)/dw. I will repeatedly do that until the algorithm converges. So couple of points in the notation, alpha here, is the learning rate, and controls how big a step we take on each iteration or gradient descent. We'll talk later about some ways by choosing the learning rate alpha. And second, this quantity here, this is a derivative. This is basically the update or the change you want to make to the parameters w. When we start to write code to implement gradient descent, we're going to use the convention that the variable name in our code
4:58
dw will be used to represent this derivative term. So when you write code you write something like w colon equals w minus alpha times dw. And so we use dw to be the variable name to represent this derivative term. Now let's just make sure that this gradient descent update makes sense. Let's say that w was over here. So you're at this point on the cost function J(w). Remember that the definition of a derivative is the slope of a function at the point. So the slope of the function is really the height divided by the width, right, of a low triangle here at this tangent to J(w) at that point. And so, here the derivative is positive. W gets updated as w minus a learning rate times the derivative. The derivative is positive and so you end up subtracting from w, so you end up taking a step to the left. And so gradient descent will make your algorithm slowly decrease the parameter if you have started off with this large value of w. As another example, if w was over here, then at this point the slope here of dJ/dw will be negative and so the gradient descent update would subtract alpha times a negative number. And so end up slowly increasing w, so you end up making w bigger and bigger with successive iterations and gradient descent. So that hopefully whether you initialize on the left or on the right gradient descent will move you towards this global minimum here. If you're not familiar with derivates or with calculus and what this term dJ(w)/dw means, don't worry too much about it. We'll talk some more about derivatives in the next video. If you have a deep knowledge of calculus, you might be able to have a deeper intuitions about how neural networks work. But even if you're not that familiar with calculus, in the next few videos we'll give you enough intuitions about derivatives and about calculus that you'll be able to effectively use neural networks. But the overall intuition for now is that this term represents the slope of the function, and we want to know the slope of the function at the current setting of the parameters so that we can take these steps of steepest descent, so that we know what direction to step in in order to go downhill on the cost function J.
7:36
So we wrote our gradient descent for J(s) if only w was your parameter. In logistic regression, your cost function is a function of both w and b. So in that case, the inner loop of gradient descent, that is this thing here, this thing you have to repeat becomes as follows. You end up updating w as w minus the learning rate times the derivative of J(w,b) respect to w. And you update b as b minus the learning rate times the derivative of the cost function in respect to b. So these two equations at the bottom are the actual update you implement. As an aside I just want to mention one notational convention in calculus that is a bit confusing to some people. I don't think it's super important that you understand calculus, but in case you see this I want to make sure that you don't think too much of this. Which is that in calculus, this term here, we actually write as fallows, of that funny squiggle symbol. So this symbol, this is actually just a lower case d in a fancy font, in a stylized font for when you see this expression all this means is this isn't [INAUDIBLE] J(w,b) or really the slope of the function J(w,b), how much that function slopes in the w direction. And the rule of the notation in calculus, which I think isn't totally logical, but the rule in the notation for calculus, which I think just makes things much more complicated than you need to be is that if J is a function of two or more variables, then instead of using lowercase d you use this funny symbol. This is called a partial derivative symbol. But don't worry about this, and if J is a function of only one variable, then you use lowercase d. So the only difference between whether you use this funny partial derivative symbol or lowercase d as we did on top, is whether J is a function of two or more variables. In which case, you use this symbol, the partial derivative symbol, or if J is only a function of one variable then you use lower case d. This is one of those funny rules of notation in calculus that I think just make things more complicated than they need to be. But if you see this partial derivative symbol all it means is you're measure the slope of the function, with respect to one of the variables. And similarly to adhere to the formerly correct mathematical notation in calculus, because here J has two inputs not just one. This thing at the bottom should be written with this partial derivative simple. But it really means the same thing as, almost the same thing as lower case d. Finally, when you implement this in code, we're going to use the convention that this quantity, really the amount by which you update w, will denote as the variable dw in your code. And this quantity, right? The amount by which you want to update b will denote by the variable db in your code. All right, so, that's how you can implement gradient descent. Now if you haven't seen calculus for a few years, I know that that might seem like a lot more derivatives in calculus than you might be comfortable with so far. But if you're feeling that way, don't worry about it. In the next video, we'll give you better intuition about derivatives. And even without the deep mathematical understanding of calculus, with just an intuitive understanding of calculus you will be able to make neural networks work effectively. So that, let's go onto the next video where we'll talk a little bit more about derivatives.
Derivatives - 7m
0:00
In this video, I want to help you gain an intuitive understanding, of calculus and the derivatives. Now, maybe you're thinking that you haven't seen calculus since your college days, and depending on when you graduated, maybe that was quite some time back. Now, if that's what you're thinking, don't worry, you don't need a deep understanding of calculus in order to apply new networks and deep learning very effectively. So, if you're watching this video or some of the later videos and you're wondering, well, is this stuff really for me, this calculus looks really complicated. My advice to you is the following, which is that, watch the videos and then if you could do the homeworks and complete the programming homeworks successfully, then you can apply deep learning. In fact, when you see later is that in week four, we'll define a couple of types of functions that will enable you to encapsulate everything that needs to be done with respect to calculus, that these functions called forward functions and backward functions that you learn about. That lets you put everything you need to know about calculus into these functions, so that you don't need to worry about them anymore beyond that. But I thought that in this foray into deep learning that this week, we should open up the box and peer a little bit further into the details of calculus. But really, all you need is an intuitive understanding of this in order to build and successfully apply these algorithms. Finally, if you are among that maybe smaller group of people that are expert in calculus, if you are very familiar with calculus derivatives, it's probably okay for you to skip this video. But for everyone else, let's dive in, and try to gain an intuitive understanding of derivatives. I plotted here the function f(a) equals 3a. So, it's just a straight line. To get intuition about derivatives, let's look at a few points on this function. Let say that a is equal to two. In that case, f of a, which is equal to three times a is equal to six. So, if a is equal to two, then f of a will be equal to six. Let's say we give the value of a just a little bit of a nudge. I'm going to just bump up a, a little bit, so that it is now 2.001. So, I'm going to give a like a tiny little nudge, to the right. So now, let's say 2.001, just plot this into scale, 2.001, this 0.001 difference is too small to show on this plot, just give a little nudge to that right. Now, f(a), is equal to three times that. So, it's 6.003, so we plot this over here. This is not to scale, this is 6.003. So, if you look at this little triangle here that I'm highlighting in green, what we see is that if I nudge a 0.001 to the right, then f of a goes up by 0.003. The amounts that f of a, went up is three times as big as the amount that I nudge the a to the right. So, we're going to say that, the slope or the derivative of the function f of a, at a equals to or when a is equals two to the slope is three. The term derivative basically means slope, it's just that derivative sounds like a scary and more intimidating word, whereas a slope is a friendlier way to describe the concept of derivative. So, whenever you hear derivative, just think slope of the function. More formally, the slope is defined as the height divided by the width of this little triangle that we have in green. So, this is 0.003 over 0.001, and the fact that the slope is equal to three or the derivative is equal to three, just represents the fact that when you nudge a to the right by 0.001, by tiny amount, the amount at f of a goes up is three times as big as the amount that you nudged it, that you nudged a in the horizontal direction. So, that's all that the slope of a line is. Now, let's look at this function at a different point. Let's say that a is now equal to five. In that case, f of a, three times a is equal to 15. So, let's see that again, give a, a nudge to the right. A tiny little nudge, it's now bumped up to 5.001, f of a is three times that. So, f of a is equal to 15.003. So, once again, when I bump a to the right, nudg a to the right by 0.001, f of a goes up three times as much. So the slope, again, at a = 5, is also three. So, the way we write this, that the slope of the function f is equal to three: We say, d f(a) da and this just means, the slope of the function f(a) when you nudge the variable a, a tiny little amount, this is equal to three. An alternative way to write this derivative formula is as follows. You can also write this as, d da of f(a). So, whether you put f(a) on top or whether you write it down here, it doesn't matter. But all this equation means is that, if I nudge a to the right a little bit, I expect f(a) to go up by three times as much as I nudged the value of little a. Now, for this video I explained derivatives, talking about what happens if we nudged the variable a by 0.001. If you want a formal mathematical definition of the derivatives: Derivatives are defined with an even smaller value of how much you nudge a to the right. So, it's not 0.001. It's not 0.000001. It's not 0.00000000 and so on 1. It's even smaller than that, and the formal definition of derivative says, whenever you nudge a to the right by an infinitesimal amount, basically an infinitely tiny, tiny amount. If you do that, this f(a) go up three times as much as whatever was the tiny, tiny, tiny amount that you nudged a to the right. So, that's actually the formal definition of a derivative. But for the purposes of our intuitive understanding, which I'll talk about nudging a to the right by this small amount 0.001. Even if it's 0.001 isn't exactly tiny, tiny infinitesimal. Now, one property of the derivative is that, no matter where you take the slope of this function, it is equal to three, whether a is equal to two or a is equal to five. The slope of this function is equal to three, meaning that whatever is the value of a, if you increase it by 0.001, the value of f of a goes up by three times as much. So, this function has a safe slope everywhere. One way to see that is that, wherever you draw this little triangle. The height, divided by the width, always has a ratio of three to one. So, I hope this gives you a sense of what the slope or the derivative of a function means for a straight line, where in this example the slope of the function was three everywhere. In the next video, let's take a look at a slightly more complex example, where the slope to the function can be different at different points on the function.
More Derivative Examples - 10m
0:00
In this video, I'll show you a slightly more complex example where the slope of the function can be different to different points in the function. Let's start with an example. You have plotted the function f(a) = a². Let's take a look at the point a=2. So a² or f(a) = 4. Let's nudge a slightly to the right, so now a=2.001. f(a) which is a² is going to be approximately 4.004. It turns out that the exact value, you call the calculator and figured this out is actually 4.004001. I'm just going to say 4.004 is close enough. So what this means is that when a=2, let's draw this on the plot. So what we're saying is that if a=2, then f(a) = 4 and here is the x and y axis are not drawn to scale. Technically, does vertical height should be much higher than this horizontal height so the x and y axis are not on the same scale. But if I now nudge a to 2.001 then f(a) becomes roughly 4.004. So if we draw this little triangle again, what this means is that if I nudge a to the right by 0.001, f(a) goes up four times as much by 0.004. So in the language of calculus, we say that a slope that is the derivative of f(a) at a=2 is 4 or to write this out of our calculus notation, we say that d/da of f(a) = 4 when a=2. Now one thing about this function f(a) = a² is that the slope is different for different values of a. This is different than the example we saw on the previous slide. So let's look at a different point. If a=5, so instead of a=2, and now a=5 then a²=25, so that's f(a). If I nudge a to the right again, it's tiny little nudge to a, so now a=5.001 then f(a) will be approximately 25.010. So what we see is that by nudging a up by .001, f(a) goes up ten times as much. So we have that d/da f(a) = 10 when a=5 because f(a) goes up ten times as much as a does when I make a tiny little nudge to a. So one way to see why did derivatives is different at different points is that if you draw that little triangle right at different locations on this, you'll see that the ratio of the height of the triangle over the width of the triangle is very different at different points on the curve. So here, the slope=4 when a=2, a=10, when a=5. Now if you pull up a calculus textbook, a calculus textbook will tell you that d/da of f(a), so f(a) = a², so that's d/da of a². One of the formulas you find are the calculus textbook is that this thing, the slope of the function a², is equal to 2a. Not going to prove this, but the way you find this out is that you open up a calculus textbook to the table formulas and they'll tell you that derivative of 2 of a² is 2a. And indeed, this is consistent with what we've worked out. Namely, when a=2, the slope of function to a is 2x2=4. And when a=5 then the slope of the function 2xa is 2x5=10. So, if you ever pull up a calculus textbook and you see this formula, that the derivative of a²=2a, all that means is that for any given value of a, if you nudge upward by 0.001 already your tiny little value, you will expect f(a) to go up by 2a. That is the slope or the derivative times other much you had nudged to the right the value of a. Now one tiny little detail, I use these approximate symbols here and this wasn't exactly 4.004, there's an extra .001 hanging out there. It turns out that this extra .001, this little thing here is because we were nudging a to the right by 0.001, if we're instead nudging it to the right by this infinitesimally small value then this extra every term will go away and you find that the amount that f(a) goes out is exactly equal to the derivative times the amount that you nudge a to the right. And the reason why is not 4.004 exactly is because derivatives are defined using this infinitesimally small nudges to a rather than 0.001 which is not. And while 0.001 is small, it's not infinitesimally small. So that's why the amount that f(a) went up isn't exactly given by the formula but it's only a kind of approximately given by the derivative. To wrap up this video, let's just go through a few more quick examples. The example you've already seen is that if f(a) = a² then the calculus textbooks formula table will tell you that the derivative is equal to 2a. And so the example we went through was it if (a) = 2, f(a) = 4, and we nudge a, since it's a little bit bigger than f(a) is about 4.004 and so f(a) went up four times as much and indeed when a=2, the derivatives is equal to 4. Let's look at some other examples. Let's say, instead the f(a) = a³. If you go to a calculus textbook and look up the table of formulas, you see that the slope of this function, again, the derivative of this function is equal to 3a². So you can get this formula out of the calculus textbook. So what this means? So the way to interpret this is as follows. Let's take a=2 as an example again. So f(a) or a³=8, that's two to the power of three. So we give a a tiny little nudge, you find that f(a) is about 8.012 and feel free to check this. Take 2.001 to the power of three, you find this is very close to 8.012. And indeed, when a=2 that's 3x2² does equal to 3x4, you see that's 12. So the derivative formula predicts that if you nudge a to the right by tiny little bit, f(a) should go up 12 times as much. And indeed, this is true when a went up by .001, f(a) went up 12 times as much by .012. Just one last example and then we'll wrap up. Let's say that f(a) is equal to the log function. So on the right log of a, I'm going to use this as the base e logarithm. So some people write that as log(a). So if you go to calculus textbook, you find that when you take the derivative of log(a). So this is a function that just looks like that, the slope of this function is given by 1/a. So the way to interpret this is that if a has any value then let's just keep using a=2 as an example and you nudge a to the right by .001, you would expect f(a) to go up by 1/a that is by the derivative times the amount that you increase a. So in fact, if you pull up a calculator, you find that if a=2, f(a) is about 0.69315 and if you increase f and if you increase a to 2.001 then f(a) is about 0.69365, this has gone up by 0.0005. And indeed, if you look at the formula for the derivative when a=2, d/da f(a) = 1/2. So this derivative formula predicts that if you pump up a by .001, you would expect f(a) to go up by only 1/2 as much and 1/2 of .001 is 0.0005 which is exactly what we got. Then when a goes up by .001, going from a=2 to a=2.001, f(a) goes up by half as much. So, the answers are going up by approximately .0005. So if we draw that little triangle if you will is that if on the horizontal axis just goes up by .001 on the vertical axis, log(a) goes up by half of that so .0005. And so that 1/a or 1/2 in this case, 1a=2 that's just the slope of this line when a=2. So that's it for derivatives. There are just two take home messages from this video. First is that the derivative of the function just means the slope of a function and the slope of a function can be different at different points on the function. In our first example where f(a) = 3a those a straight line. The derivative was the same everywhere, it was three everywhere. For other functions like f(a) = a² or f(a) = log(a), the slope of the line varies. So, the slope or the derivative can be different at different points on the curve. So that's a first take away. Derivative just means slope of a line. Second takeaway is that if you want to look up the derivative of a function, you can flip open your calculus textbook or look up Wikipedia and often get a formula for the slope of these functions at different points. So that, I hope you have an intuitive understanding of derivatives or slopes of lines. Let's go into the next video. We'll start to talk about the computation graph and how to use that to compute derivatives of more complex functions.
Computation graph - 3m
0:00
You've heard me say that the computations of a neural network are organized in terms of a forward pass or a forward propagation step, in which we compute the output of the neural network, followed by a backward pass or back propagation step, which we use to compute gradients or compute derivatives. The computation graph explains why it is organized this way. In this video, we'll go through an example. In order to illustrate the computation graph, let's use a simpler example than logistic regression or a full blown neural network. Let's say that we're trying to compute a function, J, which is a function of three variables a, b, and c and let's say that function is 3(a+bc). Computing this function actually has three distinct steps. The first is you need to compute what is bc and let's say we store that in the variable call u. So u=bc and then you my compute V=a *u. So let's say this is V. And then finally, your output J is 3V. So this is your final function J that you're trying to compute. We can take these three steps and draw them in a computation graph as follows. Let's say, I draw your three variables a, b, and c here. So the first thing we did was compute u=bc. So I'm going to put a rectangular box around that. And so the input to that are b and c. And then, you might have V=a+u. So the inputs to that are V. So the inputs to that are u with just computed together with a. And then finally, we have J=3V. So as a concrete example, if a=5, b=3 and c=2 then u=bc would be six because a+u would be 5+6 is 11, J is three times that, so J=33. And indeed, hopefully you can verify that this is three times five plus three times two. And if you expand that out, you actually get 33 as the value of J. So, the computation graph comes in handy when there is some distinguished or some special output variable, such as J in this case, that you want to optimize. And in the case of a logistic regression, J is of course the cos function that we're trying to minimize. And what we're seeing in this little example is that, through a left-to-right pass, you can compute the value of J. And what we'll see in the next couple of slides is that in order to compute derivatives, there'll be a right-to-left pass like this, kind of going in the opposite direction as the blue arrows. That would be most natural for computing the derivatives. So to recap, the computation graph organizes a computation with this blue arrow, left-to-right computation. Let's refer to the next video how you can do the backward red arrow right-to-left computation of the derivatives. Let's go on to the next video.
Derivatives with a Computation Graph - 14m
0:00
In the last video, we worked through an example of using a computation graph to compute a function J. Now, let's take a clean diversion of that computation graph. And show how you can use it to figure out derivative calculations for that function J. So here's a computation graph. Let's say you want to compute the derivative of J with respect to v.
0:23
So what is that? Well, this says, if we were to take this value of v and change it a little bit, how would the value of J change? Well, J is defined as 3 times v. And right now, v = 11. So if we're to bump up v by a little bit to 11.001, then J, which is 3v, so currently 33, will get bumped up to 33.003. So here, we've increased v by 0.001. And the net result of that is that J goes out 3 times as much. So the derivative of J with respect to v is equal to 3. Because the increase in J is 3 times the increase in v. And in fact, this is very analogous to the example we had in the previous video, where we had f(a) = 3a. And so we then derived that df/da, which with slightly simplified, a slightly sloppy notation, you can write as df/da = 3. So instead, here we have J = 3v, and so dJ/dv = 3. With here, J playing the role of f, and v playing the role of a in this previous example that we had from an earlier video. So indeed, terminology of backpropagation, what we're seeing is that if you want to compute the derivative of this final output variable, which usually is a variable you care most about, with respect to v, then we've done one step of backpropagation. So we call it one step backwards in this graph. Now let's look at another example. What is dJ/da? In other words, if we bump up the value of a, how does that affect the value of J?
2:35
Well, let's go through the example, where now a = 5. So let's bump it up to 5.001. The net impact of that is that v, which was a + u, so that was previously 11. This would get increased to 11.001. And then we've already seen as above that J now gets bumped up to 33.003. So what we're seeing is that if you increase a by 0.001, J increases by 0.003. And by increase a, I mean, you have to take this value of 5 and just plug in a new value. Then the change to a will propagate to the right of the computation graph so that J ends up being 33.003. And so the increase to J is 3 times the increase to a. So that means this derivative is equal to 3. And one way to break this down is to say that if you change a, then that will change v.
3:40
And through changing v, that would change J. And so the net change to the value of J when you bump up the value, when you nudge the value of a up a little bit, is that,
3:57
First, by changing a, you end up increasing v. Well, how much does v increase? It is increased by an amount that's determined by dv/da. And then the change in v will cause the value of J to also increase. So in calculus, this is actually called the chain rule that if a affects v, affects J, then the amounts that J changes when you nudge a is the product of how much v changes when you nudge a times how much J changes when you nudge v. So in calculus, again, this is called the chain rule. And what we saw from this calculation is that if you increase a by 0.001, v changes by the same amount. So dv/da = 1. So in fact, if you plug in what we have wrapped up previously, dv/dJ = 3 and dv/da = 1. So the product of these 3 times 1, that actually gives you the correct value that dJ/da = 3. So this little illustration shows hows by having computed dJ/dv, that is, derivative with respect to this variable, it can then help you to compute dJ/da. And so that's another step of this backward calculation.
5:39
I just want to introduce one more new notational convention. Which is that when you're witting codes to implement backpropagation, there will usually be some final output variable that you really care about. So a final output variable that you really care about or that you want to optimize. And in this case, this final output variable is J. It's really the last node in your computation graph. And so a lot of computations will be trying to compute the derivative of that final output variable. So d of this final output variable with respect to some other variable. Then we just call that dvar. So a lot of the computations you have will be to compute the derivative of the final output variable, J in this case, with various intermediate variables, such as a, b, c, u or v. And when you implement this in software, what do you call this variable name? One thing you could do is in Python, you could give us a very long variable name like dFinalOurputVar/dvar. But that's a very long variable name. You could call this, I guess, dJdvar. But because you're always taking derivatives with respect to dJ, with respect to this final output variable, I'm going to introduce a new notation. Where, in code, when you're computing this thing in the code you write, we're just going to use the variable name dvar in order to represent that quantity. So dvar in a code you write will represent the derivative of the final output variable you care about such as J. Well, sometimes, the last l with respect to the various intermediate quantities you're computing in your code. So this thing here in your code, you use dv to denote this value. So dv would be equal to 3. And your code, you represent this as da, which is we also figured out to be equal to 3. So we've done backpropagation partially through this computation graph. Let's go through the rest of this example on the next slide. So let's go to a cleaned up copy of the computation graph. And just to recap, what we've done so far is go backward here and figured out that dv = 3. And again, the definition of dv, that's just a variable name, where the code is really dJ/dv. We've figured out that da = 3. And again, da is the variable name in your code and that's really the value dJ/da.
8:32
And we hand wave how we've gone backwards on these two edges like so. Now let's keep computing derivatives. Now let's look at the value u. So what is dJ/du? Well, through a similar calculation as what we did before and then we start off with u = 6. If you bump up u to 6.001, then v, which is previously 11, goes up to 11.001. And so J goes from 33 to 33.003. And so the increase in J is 3x, so this is equal. And the analysis for u is very similar to the analysis we did for a. This is actually computed as dJ/dv times dv/du, where this we had already figured out was 3. And this turns out to be equal to 1. So we've gone up one more step of backpropagation. We end up computing that du is also equal to 3. And du is, of course, just this dJ/du. Now we just step through one last example in detail. So what is dJ/db? So here, imagine if you are allowed to change the value of b. And you want to tweak b a little bit in order to minimize or maximize the value of J. So what is the derivative or what's the slope of this function J when you change the value of b a little bit?
10:11
It turns out that using the chain rule for calculus, this can be written as the product of two things. This dJ/du times du/db. And the reasoning is if you change b a little bit, so b = 3 to, say, 3.001. The way that it will affect J is it will first affect u. So how much does it affect u? Well, u is defined as b times c. So this will go from 6, when b = 3, to now 6.002 because c = 2 in our example here. And so this tells us that du/db = 2. Because when you bump up b by 0.001, u increases twice as much. So du/db, this is equal to 2. And now, we know that u has gone up twice as much as b has gone up. Well, what is dJ/du? We've already figured out that this is equal to 3. And so by multiplying these two out, we find that dJ/db = 6. And again, here's the reasoning for the second part of the argument. Which is we want to know when u goes up by 0.002, how does that affect J? The fact that dJ/du = 3, that tells us that when u goes up by 0.002, J goes up 3 times as much. So J should go up by 0.006. So this comes from the fact that dJ/du = 3. And if you check the math in detail, you will find that if b becomes 3.001, then u becomes 6.002, v becomes 11.002. So that's a + u, so that's 5 + u. And then J, which is equal to 3 times v, that ends up being equal to 33.006. And so that's how you get that dJ/db = 6. And to fill that in, this is if we go backwards, so this is db = 6. And db really is the Python code variable name for dJ/db. And I won't go through the last example in great detail. But it turns out that if you also compute out dJ, this turns out to be dJ/du times du. And this turns out to be 9, this turns out to be 3 times 3. I won't go through that example in detail. So through this last step, it is possible to derive that dc is equal to.
13:20
So the key takeaway from this video, from this example, is that when computing derivatives and computing all of these derivatives, the most efficient way to do so is through a right to left computation following the direction of the red arrows. And in particular, we'll first compute the derivative with respect to v. And then that becomes useful for computing the derivative with respect to a and the derivative with respect to u. And then the derivative with respect to u, for example, this term over here and this term over here. Those in turn become useful for computing the derivative with respect to b and the derivative with respect to c. So that was the computation graph and how does a forward or left to right calculation to compute the cost function such as J that you might want to optimize. And a backwards or a right to left calculation to compute derivatives. If you're not familiar with calculus or the chain rule, I know some of those details, but they've gone by really quickly. But if you didn't follow all the details, don't worry about it. In the next video, we'll go over this again in the context of logistic regression. And show you exactly what you need to do in order to implement the computations you need to compute the derivatives of the logistic regression model.
Logistic Regression Gradient Descent - 6m
0:00
Welcome back. In this video, we'll talk about how to compute derivatives for you to implement gradient descent for logistic regression. The key takeaways will be what you need to implement. That is, the key equations you need in order to implement gradient descent for logistic regression. In this video, I want to do this computation using the computation graph. I have to admit, using the computation graph is a little bit of an overkill for deriving gradient descent for logistic regression, but I want to start explaining things this way to get you familiar with these ideas so that, hopefully, it will make a bit more sense when we talk about full-fledged neural networks. To that, let's dive into gradient descent for logistic regression. To recap, we had set up logistic regression as follows, your predictions, Y_hat, is defined as follows, where z is that. If we focus on just one example for now, then the loss, or respect to that one example, is defined as follows, where A is the output of logistic regression, and Y is the ground truth label. Let's write this out as a computation graph and for this example, let's say we have only two features, X1 and X2. In order to compute Z, we'll need to input W1, W2, and B, in addition to the feature values X1, X2. These things, in a computational graph, get used to compute Z, which is W1, X1 + W2 X2 + B, rectangular box around that. Then, we compute Y_hat, or A = Sigma_of_Z, that's the next step in the computation graph, and then, finally, we compute L, AY, and I won't copy the formula again. In logistic regression, what we want to do is to modify the parameters, W and B, in order to reduce this loss. We've described the four propagation steps of how you actually compute the loss on a single training example, now let's talk about how you can go backwards to compute the derivatives. Here's a cleaned-up version of the diagram. Because what we want to do is compute derivatives with respect to this loss, the first thing we want to do when going backwards is to compute the derivative of this loss with respect to, the script over there, with respect to this variable A. So, in the code, you just use DA to denote this variable. It turns out that if you are familiar with calculus, you could show that this ends up being -Y_over_A+1-Y_over_1-A. And the way you do that is you take the formula for the loss and, if you're familiar with calculus, you can compute the derivative with respect to the variable, lowercase A, and you get this formula. But if you're not familiar with calculus, don't worry about it. We'll provide the derivative form, what else you need, throughout this course. If you are an expert in calculus, I encourage you to look up the formula for the loss from their previous slide and try taking derivative with respect to A using calculus, but if you don't know enough calculus to do that, don't worry about it. Now, having computed this quantity of DA and the derivative or your final alpha variable with respect to A, you can then go backwards. It turns out that you can show DZ which, this is the part called variable name, this is going to be the derivative of the loss, versus back to Z, or for L, you could really write the loss including A and Y explicitly as parameters or not, right? Either type of notation is equally acceptable. We can show that this is equal to A-Y. Just a couple of comments only for those of you experts in calculus, if you're not expert in calculus, don't worry about it. But it turns out that this, DL DZ, this can be expressed as DL_DA_times_DA_DZ, and it turns out that DA DZ, this turns out to be A_times_1-A, and DL DA we have previously worked out over here, if you take these two quantities, DL DA, which is this term, together with DA DZ, which is this term, and just take these two things and multiply them. You can show that the equation simplifies to A-Y. That's how you derive it, and that this is really the chain rule that have briefly eluded to the form. Feel free to go through that calculation yourself if you are knowledgeable in calculus, but if you aren't, all you need to know is that you can compute DZ as A-Y and we've already done that calculus for you. Then, the final step in that computation is to go back to compute how much you need to change W and B. In particular, you can show that the derivative with respect to W1 and in quotes, call this DW1, that this is equal to X1_times_DZ. Then, similarly, DW2, which is how much you want to change W2, is X2_times_DZ and B, excuse me, DB is equal to DZ. If you want to do gradient descent with respect to just this one example, what you would do is the following; you would use this formula to compute DZ, and then use these formulas to compute DW1, DW2, and DB, and then you perform these updates. W1 gets updated as W1 minus, learning rate alpha, times DW1. W2 gets updated similarly, and B gets set as B minus the learning rate times DB. And so, this will be one step of grade with respect to a single example. You see in how to compute derivatives and implement gradient descent for logistic regression with respect to a single training example. But training logistic regression model, you have not just one training example given training sets of M training examples. In the next video, let's see how you can take these ideas and apply them to learning, not just from one example, but from an entire training set.
Gradient Descent on m Examples - 8m
0:00
In a previous video, you saw how to compute derivatives and implement gradient descent with respect to just one training example for logistic regression. Now, we want to do it for m training examples. To get started, let's remind ourselves of the definition of the cost function J. Cost- function w,b,which you care about is this average, one over m sum from i equals one through m of the loss when you algorithm output a_i on the example y, where a_i is the prediction on the ith training example which is sigma of z_i, which is equal to sigma of w transpose x_i plus b. So, what we show in the previous slide is for any single training example, how to compute the derivatives when you have just one training example. So dw_1, dw_2 and d_b, with now the superscript i to denote the corresponding values you get if you were doing what we did on the previous slide, but just using the one training example, x_i y_i, excuse me, missing an i there as well. So, now you notice the overall cost functions as a sum was really average, because the one over m term of the individual losses. So, it turns out that the derivative, respect to w_1 of the overall cost function is also going to be the average of derivatives respect to w_1 of the individual lost terms. But previously, we have already shown how to compute this term as dw_1_i, which we, on the previous slide, show how to compute this on a single training example. So, what you need to do is really compute these derivatives as we showed on the previous training example and average them, and this will give you the overall gradient that you can use to implement the gradient descent. So, I know that was a lot of details, but let's take all of this up and wrap this up into a concrete algorithm until when you should implement logistic regression with gradient descent working. So, here's what you can do: let's initialize j equals zero, dw_1 equals zero, dw_2 equals zero, d_b equals zero. What we're going to do is use a for loop over the training set, and compute the derivative with respect to each training example and then add them up. So, here's how we do it, for i equals one through m, so m is the number of training examples, we compute z_i equals w transpose x_i plus b. The prediction a_i is equal to sigma of z_i, and then let's add up J, J plus equals y_i log a_i plus one minus y_i log one minus a_i, and then put the negative sign in front of the whole thing, and then as we saw earlier, we have dz_i, that's equal to a_i minus y_i, and d_w gets plus equals x1_i dz_i, dw_2 plus equals xi_2 dz_i, and I'm doing this calculation assuming that you have just two features, so that n equals to two otherwise, you do this for dw_1, dw_2, dw_3 and so on, and then db plus equals dz_i, and I guess that's the end of the for loop. Then finally, having done this for all m training examples, you will still need to divide by m because we're computing averages. So, dw_1 divide equals m, dw_2 divides equals m, db divide equals m, in order to compute averages. So, with all of these calculations, you've just computed the derivatives of the cost function J with respect to each your parameters w_1, w_2 and b. Just a couple of details about what we're doing, we're using dw_1 and dw_2 and db as accumulators, so that after this computation, dw_1 is equal to the derivative of your overall cost function with respect to w_1 and similarly for dw_2 and db. So, notice that dw_1 and dw_2 do not have a superscript i, because we're using them in this code as accumulators to sum over the entire training set. Whereas in contrast, dz_i here, this was dz with respect to just one single training example. So, that's why that had a superscript i to refer to the one training example, i that is computerised. So, having finished all these calculations, to implement one step of gradient descent, you will implement w_1, gets updated as w_1 minus the learning rate times dw_1, w_2, ends up this as w_2 minus learning rate times dw_2, and b gets updated as b minus the learning rate times db, where dw_1, dw_2 and db were as computed. Finally, J here will also be a correct value for your cost function. So, everything on the slide implements just one single step of gradient descent, and so you have to repeat everything on this slide multiple times in order to take multiple steps of gradient descent. In case these details seem too complicated, again, don't worry too much about it for now, hopefully all this will be clearer when you go and implement this in the programming assignments. But it turns out there are two weaknesses with the calculation as we've implemented it here, which is that, to implement logistic regression this way, you need to write two for loops. The first for loop is this for loop over the m training examples, and the second for loop is a for loop over all the features over here. So, in this example, we just had two features; so, n is equal to two and x equals two, but maybe we have more features, you end up writing here dw_1 dw_2, and you similar computations for dw_t, and so on delta dw_n. So, it seems like you need to have a for loop over the features, over n features. When you're implementing deep learning algorithms, you find that having explicit for loops in your code makes your algorithm run less efficiency. So, in the deep learning era, we would move to a bigger and bigger datasets, and so being able to implement your algorithms without using explicit for loops is really important and will help you to scale to much bigger datasets. So, it turns out that there are a set of techniques called vectorization techniques that allow you to get rid of these explicit for-loops in your code. I think in the pre-deep learning era, that's before the rise of deep learning, vectorization was a nice to have, so you could sometimes do it to speed up your code and sometimes not. But in the deep learning era, vectorization, that is getting rid of for loops, like this and like this, has become really important, because we're more and more training on very large datasets, and so you really need your code to be very efficient. So, in the next few videos, we'll talk about vectorization and how to implement all this without using even a single for loop. So, with this, I hope you have a sense of how to implement logistic regression or gradient descent for logistic regression. Things will be clearer when you implement the programming exercise. But before actually doing the programming exercise, let's first talk about vectorization so that you can implement this whole thing, implement a single iteration of gradient descent without using any for loops.
Vectorization - 8m
0:00
Welcome back. Vectorization is basically the art of getting rid of explicit folders in your code. In the deep learning era safety in deep learning in practice, you often find yourself training on relatively large data sets, because that's when deep learning algorithms tend to shine. And so, it's important that your code very quickly because otherwise, if it's running on a big data set, your code might take a long time to run then you just find yourself waiting a very long time to get the result. So in the deep learning era, I think the ability to perform vectorization has become a key skill. Let's start with an example. So, what is Vectorization? In logistic regression you need to compute Z equals W transpose X plus B, where W was this column vector and X is also this vector. Maybe there are very large vectors if you have a lot of features. So, W and X were both these R and no R, NX dimensional vectors. So, to compute W transpose X, if you had a non-vectorized implementation, you would do something like Z equals zero. And then for I in range of X. So, for I equals 1, 2 NX, Z plus equals W I times XI. And then maybe you do Z plus equal B at the end. So, that's a non-vectorized implementation. Then you find that that's going to be really slow. In contrast, a vectorized implementation would just compute W transpose X directly. In Python or a numpy, the command you use for that is Z equals np.W, X, so this computes W transpose X. And you can also just add B to that directly. And you find that this is much faster. Let's actually illustrate this with a little demo. So, here's my Jupiter notebook in which I'm going to write some Python code. So, first, let me import the numpy library to import. Send P. And so, for example, I can create A as an array as follows. Let's say print A. Now, having written this chunk of code, if I hit shift enter, then it executes the code. So, it created the array A and it prints it out. Now, let's do the Vectorization demo. I'm going to import the time libraries, since we use that, in order to time how long different operations take. Can they create an array A? Those random thought round. This creates a million dimensional array with random values. b = np.random.rand. Another million dimensional array. And, now, tic=time.time, so this measure the current time, c = np.dot (a, b). toc = time.time. And this print, it is the vectorized version. It's a vectorize version. And so, let's print out. Let's see the last time, so there's toc - tic x 1000, so that we can express this in milliseconds. So, ms is milliseconds. I'm going to hit Shift Enter. So, that code took about three milliseconds or this time 1.5, maybe about 1.5 or 3.5 milliseconds at a time. It varies a little bit as I run it, but looks like maybe on average it's taking like 1.5 milliseconds, maybe two milliseconds as I run this. All right. Let's keep adding to this block of code. That's not implementing non-vectorize version. Let's see, c = 0, then tic = time.time. Now, let's implement a folder. For I in range of 1 million, I'll pick out the number of zeros right. C += (a,i) x (b, i), and then toc = time.time. Finally, print more than explicit full loop. The time it takes is this 1000 x toc - tic + "ms" to know that we're doing this in milliseconds. Let's do one more thing. Let's just print out the value of C we compute it to make sure that it's the same value in both cases. I'm going to hit shift enter to run this and check that out. In both cases, the vectorize version and the non-vectorize version computed the same values, as you know, 2.50 to 6.99, so on. The vectorize version took 1.5 milliseconds. The explicit for loop and non-vectorize version took about 400, almost 500 milliseconds. The non-vectorize version took something like 300 times longer than the vectorize version. With this example you see that if only you remember to vectorize your code, your code actually runs over 300 times faster. Let's just run it again. Just run it again. Yeah. Vectorize version 1.5 milliseconds seconds and the four loop. So 481 milliseconds, again, about 300 times slower to do the explicit four loop. If the engine x slows down, it's the difference between your code taking maybe one minute to run versus taking say five hours to run. And when you are implementing deep learning algorithms, you can really get a result back faster. It will be much faster if you vectorize your code. Some of you might have heard that a lot of scaleable deep learning implementations are done on a GPU or a graphics processing unit. But all the demos I did just now in the Jupiter notebook where actually on the CPU. And it turns out that both GPU and CPU have parallelization instructions. They're sometimes called SIMD instructions. This stands for a single instruction multiple data. But what this basically means is that, if you use built-in functions such as this np.function or other functions that don't require you explicitly implementing a for loop. It enables Phyton Pi to take much better advantage of parallelism to do your computations much faster. And this is true both computations on CPUs and computations on GPUs. It's just that GPUs are remarkably good at these SIMD calculations but CPU is actually also not too bad at that. Maybe just not as good as GPUs. You're seeing how vectorization can significantly speed up your code. The rule of thumb to remember is whenever possible, avoid using explicit four loops. Let's go onto the next video to see some more examples of vectorization and also start to vectorize logistic regression.
More Vectorization Examples - 6m
0:00
In the previous video you saw a few examples of how vectorization, by using built in functions and by avoiding explicit for loops, allows you to speed up your code significantly. Let's look at a few more examples.
0:13
The rule of thumb to keep in mind is, when you're programming your new networks, or when you're programming just a regression, whenever possible avoid explicit for-loops. And it's not always possible to never use a for-loop, but when you can use a built in function or find some other way to compute whatever you need, you'll often go faster than if you have an explicit for-loop. Let's look at another example. If ever you want to compute a vector u as the product of the matrix A, and another vector v, then the definition of our matrix multiply is that your Ui is equal to sum over j,, Aij, Vj. That's how you define Ui. And so the non-vectorized implementation of this would be to set u equals NP.zeros, it would be n by 1. For i, and so on. For j, and so on..
1:16
And then u[i] plus equals a[i][j] times v[j]. So now, this is two for-loops, looping over both i and j. So, that's a non-vectorized version, the vectorized implementation which is to say u equals np dot (A,v). And the implementation on the right, the vectorized version, now eliminates two different for-loops, and it's going to be way faster. Let's go through one more example. Let's say you already have a vector, v, in memory and you want to apply the exponential operation on every element of this vector v. So you can put u equals the vector, that's e to the v1, e to the v2, and so on, down to e to the vn. So this would be a non-vectorized implementation, which is at first you initialize u to the vector of zeros. And then you have a for-loop that computes the elements one at a time. But it turns out that Python and NumPy have many built-in functions that allow you to compute these vectors with just a single call to a single function. So what I would do to implement this is import
2:36
numpy as np, and then what you just call u = np.exp(v). And so, notice that, whereas previously you had that explicit for-loop, with just one line of code here, just v as an input vector u as an output vector, you've gotten rid of the explicit for-loop, and the implementation on the right will be much faster that the one needing an explicit for-loop. In fact, the NumPy library has many of the vector value functions. So np.log (v) will compute the element-wise log, np.abs computes the absolute value, np.maximum computes the element-wise maximum to take the max of every element of v with 0. v**2 just takes the element-wise square of each element of v. One over v takes the element-wise inverse, and so on. So, whenever you are tempted to write a for-loop take a look, and see if there's a way to call a NumPy built-in function to do it without that for-loop.
3:53
So, let's take all of these learnings and apply it to our logisti regression gradient descent implementation, and see if we can at least get rid of one of the two for-loops we had. So here's our code for computing the derivatives for logistic regression, and we had two for-loops. One was this one up here, and the second one was this one. So in our example we had nx equals 2, but if you had more features than just 2 features then you'd need have a for-loop over dw1, dw2, dw3, and so on. So its as if there's actually a 4j equals 1, 2, and x. dWj gets updated. So we'd like to eliminate this second for-loop. That's what we'll do on this slide. So the way we'll do so is that instead of explicitly initializing dw1, dw2, and so on to zeros, we're going to get rid of this and instead make dw a vector. So we're going to set dw equals np.zeros, and let's make this a nx by 1, dimensional vector.
5:11
Then, here, instead of this for loop over the individual components, we'll just use this vector value operation, dw plus equals xi times dz(i). And then finally, instead of this, we will just have dw divides equals m. So now we've gone from having two for-loops to just one for-loop. We still have this one for-loop that loops over the individual training examples.
5:49
So I hope this video gave you a sense of vectorization. And by getting rid of one for-loop your code will already run faster. But it turns out we could do even better. So the next video will talk about how to vectorize logistic aggression even further. And you see a pretty surprising result, that without using any for-loops, without needing a for-loop over the training examples, you could write code to process the entire training sets. So, pretty much all at the same time. So, let's see that in the next video.
Vectorizing Logistic Regression - 7m
0:00
We have talked about how vectorization lets you speed up your code significantly. In this video, we'll talk about how you can vectorize the implementation of logistic regression, so they can process an entire training set, that is implement a single elevation of grading descent with respect to an entire training set without using even a single explicit for loop. I'm super excited about this technique, and when we talk about neural networks later without using even a single explicit for loop. Let's get started. Let's first examine the four propagation steps of logistic regression. So, if you have M training examples, then to make a prediction on the first example, you need to compute that, compute Z. I'm using this familiar formula, then compute the activations, you compute [inaudible] in the first example. Then to make a prediction on the second training example, you need to compute that. Then, to make a prediction on the third example, you need to compute that, and so on. And you might need to do this M times, if you have M training examples. So, it turns out, that in order to carry out the four propagation step, that is to compute these predictions on our M training examples, there is a way to do so, without needing an explicit for loop. Let's see how you can do it. First, remember that we defined a matrix capital X to be your training inputs, stacked together in different columns like this. So, this is a matrix, that is a NX by M matrix. So, I'm writing this as a Python draw pie shape, this just means that X is a NX by M dimensional matrix. Now, the first thing I want to do is show how you can compute Z1, Z2, Z3 and so on, all in one step, in fact, with one line of code. So, I'm going to construct a 1 by M matrix that's really a row vector while I'm going to compute Z1, Z2, and so on, down to ZM, all at the same time. It turns out that this can be expressed as W transpose to capital matrix X plus and then this vector B, B and so on. B, where this thing, this B, B, B, B, B thing is a 1xM vector or 1xM matrix or that is as a M dimensional row vector. So hopefully there you are with matrix multiplication. You might see that W transpose X1, X2 and so on to XM, that W transpose can be a row vector. So this W transpose will be a row vector like that. And so this first term will evaluate to W transpose X1, W transpose X2 and so on, dot, dot, dot, W transpose XM, and then we add this second term B, B, B, and so on, you end up adding B to each element. So you end up with another 1xM vector. Well that's the first element, that's the second element and so on, and that's the nth element. And if you refer to the definitions above, this first element is exactly the definition of Z1. The second element is exactly the definition of Z2 and so on. So just as X was once obtained, when you took your training examples and stacked them next to each other, stacked them horizontally. I'm going to define capital Z to be this where you take the lowercase Z's and stack them horizontally. So when you stack the lower case X's corresponding to a different training examples, horizontally you get this variable capital X and the same way when you take these lowercase Z variables, and stack them horizontally, you get this variable capital Z. And it turns out, that in order to implement this, the non-pie command is capital Z equals NP dot W dot T, that's W transpose X and then plus B. Now there is a subtlety in Python, which is at here B is a real number or if you want to say you know 1x1 matrix, is just a normal real number. But, when you add this vector to this real number, Python automatically takes this real number B and expands it out to this 1XM row vector. So in case this operation seems a little bit mysterious, this is called broadcasting in Python, and you don't have to worry about it for now, we'll talk about it some more in the next video. But the takeaway is that with just one line of code, with this line of code, you can calculate capital Z and capital Z is going to be a 1XM matrix that contains all of the lower cases Z's. Lowercase Z1 through lower case ZM. So that was Z, how about these values A. What we like to do next, is find a way to compute A1, A2 and so on to AM, all at the same time, and just as stacking lowercase X's resulted in capital X and stacking horizontally lowercase Z's resulted in capital Z, stacking lower case A, is going to result in a new variable, which we are going to define as capital A. And in the program assignment, you see how to implement a vector valued sigmoid function, so that the sigmoid function, inputs this capital Z as a variable and very efficiently outputs capital A. So you see the details of that in the programming assignment. So just to recap, what we've seen on this slide is that instead of needing to loop over M training examples to compute lowercase Z and lowercase A, one of the time, you can implement this one line of code, to compute all these Z's at the same time. And then, this one line of code, with appropriate implementation of lowercase Sigma to compute all the lowercase A's all at the same time. So this is how you implement a vectorize implementation of the four propagation for all M training examples at the same time. So to summarize, you've just seen how you can use vectorization to very efficiently compute all of the activations, all the lowercase A's at the same time. Next, it turns out, you can also use vectorization very efficiently to compute the backward propagation, to compute the gradients. Let's see how you can do that, in the next video.
Vectorizing Logistic Regression's Gradient Output - 9m
0:00
In the previous video, you saw how you can use vectorization to compute their predictions. The lowercase a's for an entire training set O at the same time. In this video, you see how you can use vectorization to also perform the gradient computations for all M training samples. Again, all sort of at the same time. And then at the end of this video, we'll put it all together and show how you can derive a very efficient implementation of logistic regression. So, you may remember that for the gradient computation, what we did was we computed dz1 for the first example, which could be a1 minus y1 and then dz2 equals a2 minus y2 and so on. And so on for all M training examples. So, what we're going to do is define a new variable, dZ is going to be dz1, dz2, dzm. Again, all the D lowercase z variables stacked horizontally. So, this would be 1 by m matrix or alternatively a m dimensional row vector. Now recall that from the previous slide, we'd already figured out how to compute capital A which was this: a1 through am and we had to find capital Y as y1 through ym. Also you know, stacked horizontally. So, based on these definitions, maybe you can see for yourself that dz can be computed as just A minus Y because it's going to be equal to a1 - y1. So, the first element, a2 - y2, so in the second element and so on. And, so this first element a1 - y1 is exactly the definition of dz1. The second element is exactly the definition of dz2 and so on. So, with just one line of code, you can compute all of this at the same time. Now, in the previous implementation, we've gotten rid of one full loop already but we still had this second full loop over 20 examples. So we initialize dw to zero to a vector of zeroes. But then we still have to loop over 20 examples where we have dw plus equals x1 times dz1, for the first training example dw plus equals x2 dz2 and so on. So we do the M times and then dw divide equals by M and similarly for B, right? db was initialized as 0 and db plus equals dz1. db plus equals dz2 down to you know dz(m) and db divide equals M. So that's what we had in the previous implementation. We'd already got rid of one full loop. So, at least now dw is a vector and we went separately updating dw1, dw2 and so on. So, we got rid of that already but we still had the full loop over the M examples in the training set. So, let's take these operations and vectorize them. Here's what we can do, for the vectorize implementation of db was doing is basically summing up, all of these dzs and then dividing by m. So, db is basically one over a m, sum from I equals once through m of dzi and well all the dzs are in that row vector and so in Python, what you do is implement you know, 1 over a m times np. sum of the z. So, you just take this variable and call the np. sum function on it and that would give you db. How about dw or just write out the correct equations who can verify is the right thing to do. DW turns out to be one over M, times the matrix X times dz transpose. And, so kind of see why that's the case. This equal to one of m then the matrix X's, x1 through xm stacked up in columns like that and dz transpose is going to be dz1 down to dz(m) like so. And so, if you figure out what this matrix times this vector works out to be, it is turns out to be one over m times x1 dz1 plus... plus xm dzm. And so, this is a n/1 vector and this is what you actually end up with, with dw because dw was taking these you know, xi dzi and adding them up and so that's what exactly this matrix vector multiplication is doing and so again, with one line of code you can compute dw. So, the vectorized implementation of the derivative calculations is just this, you use this line to implement db and use this line to implement dw and notice that with all the full loop over the training set, you can now compute the updates you want to your parameters. So now, let's put all together into how you would actually implement logistic regression. So, this is our original, highly inefficient non vectorize implementation. So, the first thing we've done in the previous video was get rid of this volume, right? So, instead of looping over dw1, dw2 and so on, we have replaced this with a vector value dw which is dw+= xi, which is now a vector times dz(i). But now, we will see that we can also get rid of not just full loop of row but also get rid of this full loop. So, here is how you do it. So, using what we have from the previous slides, you would say, capitalZ, Z equal to w transpose X + B and the code you is write capital Z equals np. w transpose X + B and then a equals sigmoid of capital Z. So, you have now computed all of this and all of this for all the values of I. Next on the previous slide, we said you would compute the z equals A - Y. So, now you computed all of this for all the values of i. Then, finally dw equals 1/m x dz transpose and db equals 1/m of you know, np. sum dz. So, you've just done front propagation and back propagation, really computing the predictions and computing the derivatives on all M training examples without using a full loop. And so the gradient descent update then would be you know W gets updated as w minus the learning rate times dw which was just computed above and B is update as B minus the learning rate times db. Sometimes it's pretty close to notice that it is an assignment, but I guess I haven't been totally consistent with that notation. But with this, you have just implemented a single elevation of gradient descent for logistic regression. Now, I know I said that we should get rid of explicit full loops whenever you can but if you want to implement multiple adjuration as a gradient descent then you still need a full loop over the number of iterations. So, if you want to have a thousand deliberations of gradient descent, you might still need a full loop over the iteration number. There is an outermost full loop like that then I don't think there is any way to get rid of that full loop. But I do think it's incredibly cool that you can implement at least one iteration of gradient descent without needing to use a full loop. So, that's it you now have a highly vectorize and highly efficient implementation of gradient descent for logistic regression. There is just one more detail that I want to talk about in the next video, which is in our description here I briefly alluded to this technique called broadcasting. Broadcasting turns out to be a technique that Python and numpy allows you to use to make certain parts of your code also much more efficient. So, let's see some more details of broadcasting in the next video.
Broadcasting in Python1 - 1m
0:00
In the previous video, I mentioned that broadcasting is another technique that you can use to make your Python code run faster. In this video, let's delve into how broadcasting in Python actually works. Let's suppose today broadcasting with an example. In this matrix, I've shown the number of calories from carbohydrates, proteins, and fats in 100 grams of four different foods. So for example, a 100 grams of apples turns out, has 56 calories from carbs, and much less from proteins and fats. Whereas, in contrast, a 100 grams of beef has 104 calories from protein and 135 calories from fat. Now, let's say your goal is to calculate the percentage of calories from carbs, proteins and fats for each of the four foods. So, for example, if you look at this column and add up the numbers in that column you get that 100 grams of apple has 56 plus 1.2 plus 1.8 so that's 59 calories. And so as a percentage the percentage of calories from carbohydrates in an apple would be 56 over 59, that's about 94.9%. So most of the calories in an apple come from carbs, whereas in contrast, most of the calories of beef come from protein and fat and so on. So the calculation you want is really to sum up each of the four columns of this matrix to get the total number of calories in 100 grams of apples, beef, eggs, and potatoes. And then to divide throughout the matrix,
1:47
so as to get the percentage of calories from carbs, proteins and fats for each of the four foods. So the question is, can you do this without an explicit for-loop? Let's take a look at how you could do that.
2:04
What I'm going to do is show you how you can set, say this matrix equal to three by four matrix A. And then with one line of Python code we're going to sum down the columns. So we're going to get four numbers corresponding to the total number of calories in these four different types of foods, 100 grams of these four different types of foods. And I'm going to use a second line of Python code to divide each of the four columns by their corresponding sum. If that verbal description wasn't very clearly, hopefully it will be clearer in a second when we look in the Python code. So here we are in the Jupiter notebook. I've already written this first piece of code to prepopulate the matrix A with the numbers we had just now, so we'll hit shift enter and just run that, so there's the matrix A. And now here are the two lines of Python code. First, we're going to compute tau equals a, that sum. And x is equals 0 means to sum vertically. We'll say more about that in a little bit. And then print cal. So we'll sum vertically. Now 59 is the total number of calories in the apple, 239 was the total number of calories in the beef and the eggs and potato and so on. And then with a compute percentage equals A/cal.reshape 1,4. Actually we want percentages, so multiply by 100 here.
3:35
And then let's print percentage.
3:40
Let's run that. And so that command we've taken the matrix A and divided it by this one by four matrix. And this gives us the matrix of percentages. So as we worked out kind of by hand just now in the apple there was a first column 94.9% of the calories are from carbs. Let's go back to the slides. So just to repeat the two lines of code we had, this is what have written out in the Jupiter notebook. To add a bit of detail this parameter, (axis = 0), means that you want Python to sum vertically. So if this is axis 0 this means to sum vertically, where as the horizontal axis is axis 1. So be able to write axis 1 or sum horizontally instead of sum vertically. And then this command here, this is an example of Python broadcasting where you take a matrix A. So this is a three by four matrix and you divide it by a one by four matrix. And technically, after this first line of codes cal, the variable cal, is already a one by four matrix. So technically you don't need to call reshape here again, so that's actually a little bit redundant. But when I'm writing Python codes if I'm not entirely sure what matrix, whether the dimensions of a matrix I often would just call a reshape command just to make sure that it's the right column vector or the row vector or whatever you want it to be. The reshape command is a constant time. It's a order one operation that's very cheap to call. So don't be shy about using the reshape command to make sure that your matrices are the size you need it to be.
5:21
Now, let's explain in greater detail how this type of operation works, right? We had a three by four matrix and we divided it by a one by four matrix. So, how can you divide a three by four matrix by a one by four matrix? Or by one by four vector?
5:40
Let's go through a few more examples of broadcasting. If you take a 4 by 1 vector and add it to a number, what Python will do is take this number and auto-expand it into a four by one vector as well, as follows. And so the vector [1, 2, 3, 4] plus the number 100 ends up with that vector on the right. You're adding a 100 to every element, and in fact we use this form of broadcasting where that constant was the parameter b in an earlier video. And this type of broadcasting works with both column vectors and row vectors, and in fact we use a similar form of broadcasting earlier with the constant we're adding to a vector being the parameter b in logistic regression. Here's another example. Let's say you have a two by three matrix and you add it to this one by n matrix.
6:40
So the general case would be if you have some (m,n) matrix here and you add it to a (1,n) matrix. What Python will do is copy the matrix m, times to turn this into m by n matrix, so instead of this one by three matrix it'll copy it twice in this example to turn it into this. Also, two by three matrix and we'll add these so you'll end up with the sum on the right, okay? So you taken, you added 100 to the first column, added 200 to second column, added 300 to the third column. And this is basically what we did on the previous slide, except that we use a division operation instead of an addition operation.
7:34
So one last example, whether you have a (m,n) matrix and you add this to a (m,1) vector, (m,1) matrix.
7:47
Then just copy this n times horizontally. So you end up with an (m,n) matrix. So as you can imagine you copy it horizontally three times. And you add those. So when you add them you end up with this. So we've added 100 to the first row and added 200 to the second row.
8:08
Here's the more general principle of broadcasting in Python. If you have an (m,n) matrix and you add or subtract or multiply or divide with a (1,n) matrix, then this will copy it n times into an (m,n) matrix. And then apply the addition, subtraction, and multiplication of division element wise.
8:37
If conversely, you were to take the (m,n) matrix and add, subtract, multiply, divide by an (m,1) matrix, then also this would copy it now n times. And turn that into an (m,n) matrix and then apply the operation element wise. Just one of the broadcasting, which is if you have an (m,1) matrix, so that's really a column vector like [1,2,3], and you add, subtract, multiply or divide by a row number. So maybe a (1,1) matrix. So such as that plus 100, then you end up copying this real number n times until you'll also get another (n,1) matrix. And then you perform the operation such as addition on this example element-wise. And something similar also works for row vectors.
9:38
The fully general version of broadcasting can do even a little bit more than this. If you're interested you can read the documentation for NumPy, and look at broadcasting in that documentation. That gives an even slightly more general definition of broadcasting. But the ones on the slide are the main forms of broadcasting that you end up needing to use when you implement a neural network.
10:03
Before we wrap up, just one last comment, which is for those of you that are used to programming in either MATLAB or Octave, if you've ever used the MATLAB or Octave function bsxfun in neural network programming bsxfun does something similar, not quite the same. But it is often used for similar purpose as what we use broadcasting in Python for. But this is really only for very advanced MATLAB and Octave users, if you've not heard of this, don't worry about it. You don't need to know it when you're coding up neural networks in Python. So, that was broadcasting in Python. I hope that when you do the programming homework that broadcasting will allow you to not only make a code run faster, but also help you get what you want done with fewer lines of code.
10:50
Before you dive into the programming excercise, I want to share with you just one more set of ideas, which is that there's some tips and tricks that I've found reduces the number of bugs in my Python code and that I hope will help you too. So with that, let's talk about that in the next video.
A note on python/numpy vectors - 6m
0:00
The ability of python to allow you to use broadcasting operations and more generally, the great flexibility of the python numpy program language is, I think, both a strength as well as a weakness of the programming language. I think it's a strength because they create expressivity of the language. A great flexibility of the language lets you get a lot done even with just a single line of code. But there's also weakness because with broadcasting and this great amount of flexibility, sometimes it's possible you can introduce very subtle bugs or very strange looking bugs, if you're not familiar with all of the intricacies of how broadcasting and how features like broadcasting work. For example, if you take a column vector and add it to a row vector, you would expect it to throw up a dimension mismatch or type error or something. But you might actually get back a matrix as a sum of a row vector and a column vector. So there is an internal logic to these strange effects of Python. But if you're not familiar with Python, I've seen some students have very strange, very hard to find bugs. So what I want to do in this video is share with you some couple tips and tricks that have been very useful for me to eliminate or simplify and eliminate all the strange looking bugs in my own code. And I hope that with these tips and tricks, you'll also be able to much more easily write bug-free, python and numpy code.
1:25
To illustrate one of the less intuitive effects of Python-Numpy, especially how you construct vectors in Python-Numpy, let me do a quick demo. Let's set a = np.random.randn(5), so this creates five random Gaussian variables stored in array a. And so let's print(a) and now it turns out that the shape of a when you do this is this five color structure. And so this is called a rank 1 array in Python and it's neither a row vector nor a column vector. And this leads it to have some slightly non-intuitive effects. So for example, if I print a transpose, it ends up looking the same as a. So a and a transpose end up looking the same. And if I print the inner product between a and a transpose, you might think a times a transpose is maybe the outer product should give you matrix maybe. But if I do that, you instead get back a number. So what I would recommend is that when you're coding new networks, that you just not use data structures where the shape is 5, or n, rank 1 array. Instead, if you set a to be this, (5,1), then this commits a to be (5,1) column vector. And whereas previously, a and a transpose looked the same, it becomes now a transpose, now a transpose is a row vector. Notice one subtle difference. In this data structure, there are two square brackets when we print a transpose. Whereas previously, there was one square bracket. So that's the difference between this is really a 1 by 5 matrix versus one of these rank 1 arrays. And if you print, say, the product between a and a transpose, then this gives you the outer product of a vector, right? And so, the outer product of a vector gives you a matrix. So, let's look in greater detail at what we just saw here. The first command that we ran, just now, was this. And this created a data structure with a.shape was this funny thing (5,) so this is called a rank 1 array. And this is a very funny data structure. It doesn't behave consistently as either a row vector nor a column vector, which makes some of its effects nonintuitive. So what I'm going to recommend is that when you're doing your programing exercises, or in fact when you're implementing logistic regression or neural networks that you just do not use these rank 1 arrays.
4:21
Instead, if every time you create an array, you commit to making it either a column vector, so this creates a (5,1) vector, or commit to making it a row vector, then the behavior of your vectors may be easier to understand. So in this case, a.shape is going to be equal to 5,1. And so this behaves a lot like a, but in fact, this is a column vector. And that's why you can think of this as (5,1) matrix, where it's a column vector. And here a.shape is going to be 1,5, and this behaves consistently as a row vector.
5:02
So when you need a vector, I would say either use this or this, but not a rank 1 array. One more thing that I do a lot in my code is if I'm not entirely sure what's the dimension of one of my vectors, I'll often throw in an assertion statement like this, to make sure, in this case, that this is a (5,1) vector. So this is a column vector. These assertions are really Set to execute, and they also help to serve as documentation for your code. So don't hesitate to throw in assertion statements like this whenever you feel like. And then finally, if for some reason you do end up with a rank 1 array, You can reshape this, a equals a.reshape into say a (5,1) array or a (1,5) array so that it behaves more consistently as either column vector or row vector. So I've sometimes seen students end up with very hard to track because those are the nonintuitive effects of rank 1 arrays. By eliminating rank 1 arrays in my old code, I think my code became simpler. And I did not actually find it restrictive in terms of things I could express in code. I just never used a rank 1 array. And so takeaways are to simplify your code, don't use rank 1 arrays. Always use either n by one matrices, basically column vectors, or one by n matrices, or basically row vectors. Feel free to toss a lot of insertion statements, so double-check the dimensions of your matrices and arrays. And also, don't be shy about calling the reshape operation to make sure that your matrices or your vectors are the dimension that you need it to be. So that, I hope that this set of suggestions helps you to eliminate a cause of bugs from Python code, and makes the problem exercise easier for you to complete.
Quick tour of Jupyter/iPython Notebooks - 3m
0:00
With everything you've learned, you're just about ready to tackle your first programming assignment. Before you do that, let me just give you a quick tour of iPython notebooks in Coursera. Here you see Jupiter iPython notebook that you can get to on Coursera. Let me just quickly show you a few features of this. The instructions are written right here in the text in the iPython notebook. And these long light gray blocks are blocks of code. So occasionally, you'll see in these blocks something that'll say this is the START CODE HERE and END CODE HERE. To do your exercise please make sure to write your code between the START CODE HERE and END CODE HERE. So, for example, print Hello world. And then to execute a code block, you can hit shift+enter and then execute this code block which, I guess, we just wrote print Hello world. So that prints it Hello World. To run a cell, you can also, to run one of these code blocks of cell, you can also click cell and then run cell. So that executes this. It's possible that on your computer, the keyboard shortcut for Cell, Run Cell might be different than shift+enter. But on both, my Mac as well as on my PC is shift+enter, so might be the same for you as well. Now when you're reading the instructions, if you accidentally double click on it, you might end up with this mark down language. If you end up with this funny looking text, to convert it back to the nice looking text just run this Cell. So you can go to Cell, Run Cell or I'm going to hit shift+enter and that basically executes the mark down and turns it back into this nice looking code. Just a couple more tips. When you execute code like this, it actually runs on a kernel, on a piece of code that runs on the server. If you're running an excessively large job or if you leave a computer for a very long time or something goes wrong, your internet connection or something, there is a small chance that a kernel on the back end might die, in which case, just click Kernel and then restart Kernel. And hopefully, that will reboot the kernel and make it work again. So that shouldn't happen if you're just running relatively small jobs and you're just starting up iPython notebook. If you see an error message that the Kernel has died or something, you can try Kernel, Restart. Finally, in iPython notebook, like this, there may be multiple blocks of code. So even if an earlier block of code doesn't have any create in code, make sure to execute this block of code because, in this example, it imports numpy as np and so on, and sets up some of the variables that you might need in order to execute the lower down blocks of code. So be sure to execute the ones on top even if you aren't asked to write any code in them. And finally, when you're done implementing your solutions, there's this blue submit assignment buttons here on the upper right and we click that to submit your solutions for grading. I've found that the interactive command shell nature of iPython notebooks to be very useful for learning quickly, implement a few lines of code, see an outcome, learn and add very quickly. And so I hope that from the exercises in Coursera, Jupyter iPython notebooks will help you quickly learn and experiment and see how to implement these algorithms. There's one more video after this. This is an optional video that talks about the cost function for logistic regression. You can watch that or not. Either way is perfectly fine. But either way, best of luck with the week 2 programming assignments. And I also look forward to seeing you at the start of the week three.
Explanation of logistic regression cost function (optional) - 7m
0:00
In an earlier video, I've written down a form for the cost function for logistic regression. In this optional video, I want to give you a quick justification for why we like to use that cost function for logistic regression. To quickly recap, in logistic regression, we have that the prediction y hat is sigmoid of w transpose x + b, where sigmoid is this familiar function. And we said that we want to interpret y hat as the p( y = 1 | x). So we want our algorithm to output y hat as the chance that y = 1 for a given set of input features x. So another way to say this is that if y is equal to 1 then the chance of y given x is equal to y hat. And conversely if y is equal to 0 then
1:00
the chance that y was 0 was 1- y hat, right? So if y hat was a chance, that y = 1, then 1- y hat is the chance that y = 0. So, let me take these last two equations and just copy them to the next slide. So what I'm going to do is take these two equations which basically define p(y|x) for the two cases of y = 0 or y = 1. And then take these two equations and summarize them into a single equation. And just to point out y has to be either 0 or 1 because in binary cost equations, y = 0 or 1 are the only two possible cases, all right. When someone take these two equations and summarize them as follows. Let me just write out what it looks like, then we'll explain why it looks like that. So (1 – y hat) to the power of (1 – y). So it turns out this one line summarizes the two equations on top. Let me explain why. So in the first case, suppose y = 1, right? So if y = 1 then this term ends up being y hat, because that's y hat to the power of 1. This term ends up being 1- y hat to the power of 1- 1, so that's the power of 0. But, anything to the power of 0 is equal to 1, so that goes away. And so, this equation, just as p(y|x) = y hat, when y = 1. So that's exactly what we wanted. Now how about the second case, what if y = 0? If y = 0, then this equation above is p(y|x) = y hat to the 0, but anything to the power of 0 is equal to 1, so that's just equal to 1 times 1- y hat to the power of 1- y. So 1- y is 1- 0, so this is just 1. And so this is equal to 1 times (1- y hat) = 1- y hat.
3:10
And so here we have that the y = 0, p (y|x) = 1- y hat, which is exactly what we wanted above. So what we've just shown is that this equation
3:25
is a correct definition for p(ylx). Now, finally, because the log function is a strictly monotonically increasing function, your maximizing log p(y|x) should give you a similar result as optimizing p(y|x). And if you compute log of p(y|x), that’s equal to log of y hat to the power of y, 1 - y hat to the power of 1 - y. And so that simplifies to y log y hat + 1- y times log 1- y hat, right? And so this is actually negative of the loss function that we had to find previously. And there's a negative sign there because usually if you're training a learning algorithm, you want to make probabilities large whereas in logistic regression we're expressing this. We want to minimize the loss function. So minimizing the loss corresponds to maximizing the log of the probability. So this is what the loss function on a single example looks like. How about the cost function, the overall cost function on the entire training set on m examples? Let's figure that out. So, the probability of all the labels In the training set. Writing this a little bit informally. If you assume that the training examples I've drawn independently or drawn IID, identically independently distributed, then the probability of the example is the product of probabilities. The product from i = 1 through m p(y(i) ) given x(i). And so if you want to carry out maximum likelihood estimation, right, then you want to maximize the, find the parameters that maximizes the chance of your observations and training set. But maximizing this is the same as maximizing the log, so we just put logs on both sides. So log of the probability of the labels in the training set is equal to, log of a product is the sum of the log. So that's sum from i=1 through m of log p(y(i)) given x(i). And we have previously figured out on the previous slide that this is negative L of y hat i, y i.
5:48
And so in statistics, there's a principle called the principle of maximum likelihood estimation, which just means to choose the parameters that maximizes this thing. Or in other words, that maximizes this thing. Negative sum from i = 1 through m L(y hat ,y) and just move the negative sign outside the summation. So this justifies the cost we had for logistic regression which is J(w,b) of this. And because we now want to minimize the cost instead of maximizing likelihood, we've got to rid of the minus sign. And then finally for convenience, to make sure that our quantities are better scale, we just add a 1 over m extra scaling factor there. But so to summarize, by minimizing this cost function J(w,b) we're really carrying out maximum likelihood estimation with the logistic regression model. Under the assumption that our training examples were IID, or identically independently distributed. So thank you for watching this video, even though this is optional. I hope this gives you a sense of why we use the cost function we do for logistic regression. And with that, I hope you go on to the programming exercises and the quiz questions of this week. And best of luck with both the quizzes, and the programming exercise.
(Optional) Heroes of Deep Learning - Pieter Abbeel interview - 16m
0:02
So, thanks a lot, Pieter, for joining me today. I think a lot of people know you as a well-known machine learning and deep learning and robotics researcher. I'd like to have people hear a bit about your story. How did you end up doing the work that you do? That's a good question and actually if you would have asked me as a 14-year-old, what I was aspiring to do, it probably would not have been this. In fact, at the time, I thought being a professional basketball player would be the right way to go. I don't think I was able to achieve it. I feel the machine learning lucked out, that the basketball thing didn't work out. Yes, that didn't work out. It was a lot of fun playing basketball but it didn't work out to try to make it into a career. So, what I really liked in school was physics and math. And so, from there, it seemed pretty natural to study engineering which is applying physics and math in the real world. And actually then, after my undergrad in electrical engineering, I actually wasn't so sure what to do because, literally, anything engineering seemed interesting to me. Understanding how anything works seems interesting. Trying to build anything is interesting. And in some sense, artificial intelligence won out because it seemed like it could somehow help all disciplines in some way. And also, it seemed somehow a little more at the core of everything. You think about how a machine can think, then maybe that's more the core of everything else than picking any specific discipline. I've been saying AI is the new electricity, sounds like the 14-year-old version of you; had an earlier version of that even. You know, in the past few years you've done a lot of work in deep reinforcement learning. What's happening? Why is deep reinforcement learning suddenly taking off? Before I worked in deep reinforcement learning, I worked a lot in reinforcement learning; actually with you and Durant at Stanford, of course. And so, we worked on autonomous helicopter flight, then later at Berkeley with some of my students who worked on getting a robot to learn to fold laundry. And kind of what characterized the work was a combination of learning that enabled things that would not be possible without learning, but also a lot of domain expertise in combination with the learning to get this to work. And it was very interesting because you needed domain expertise which was fun to acquire but, at the same time, was very time-consuming for every new application you wanted to succeed of; you needed domain expertise plus machine learning expertise. And for me it was in 2012 with the ImageNet breakthrough results from Geoff Hinton's group in Toronto, AlexNet showing that supervised learning, all of a sudden, could be done with far less engineering for the domain at hand. There was very little engineering by vision in AlexNet. It made me think we really should revisit reinforcement learning under the same kind of viewpoint and see if we can get the diversion of reinforcement learning to work and do equally interesting things as had just happened in the supervised learning. It sounds like you saw earlier than most people the potential of deep reinforcement learning. So now looking in to the future, what do you see next? What are your predictions for the next several ways to come in deep reinforcement learning? So, I think what's interesting about deep reinforcement learning is that, in some sense, there is many more questions than in supervised learning. In supervised learning, it's about learning an input output mapping. In reinforcement learning there is the notion of: Where does the data even come from? So that's the exploration problem. When you have data, how do you do credit assignment? How do you understand what actions you took early on got you the reward later? And then, there is issues of safety. When you have a system autonomously collecting data, it's actually rather dangerous in most situations. Imagine a self-driving car company that says, we're just going to run deep reinforcement learning. It's pretty likely that car would get into a lot of accidents before it does anything useful. You needed negative examples of that, right? You do need some negative examples somehow, yes; and positive ones, hopefully. So, I think there is still a lot of challenges in deep reinforcement learning in terms of working out some of the specifics of how to get these things to work. So, the deep part is the representation, but then the reinforcement learning itself still has a lot of questions. And what I feel is that, with the advances in deep learning, somehow one part of the puzzle in reinforcement learning has been largely addressed, which is the representation part. So, if there is a pattern we can probably represent it with a deep network and capture that pattern. And how to tease apart the pattern is still a big challenge in reinforcement learning. So I think big challenges are, how to get systems to reason over long time horizons. So right now, a lot of the successes in deep reinforcement learning are a very short horizon. There are problems where, if you act well over a five second horizon, you act well over the entire problem. And so a five second scale is something very different from a day long scale, or the ability to live a life as a robot or some software agent. So, I think there's a lot of challenges there. I think safety has a lot of challenges in terms of, how do you learn safely and also how do you keep learning once you're already pretty good? So, to give an example again that a lot of people would be familiar with, self-driving cars, for a self-driving car to be better than a human driver, should human drivers maybe get into bad accidents every three million miles or something. And so, that takes a long time to see the negative data; once you're as good as a human driver. But you want your self-driving car to be better than a human driver. And so, at that point the data collection becomes really really difficult to get that interesting data that makes your system improve. So, it's a lot of challenges related to exploration, that tie into that. But one of the things I'm actually most excited about right now is seeing if we can actually take a step back and also learn the reinforcement learning algorithm. So, reinforcement is very complex, credit assignment is very complex, exploration is very complex. And so maybe, just like how deep learning for supervised learning was able to replace a lot of domain expertise, maybe we can have programs that are learned, that are reinforcement learning programs that do all this, instead of us designing the details. During the reward function or during the whole program? So, this would be learning the entire reinforcement learning program. So, it would be, imagine, you have a reinforcement learning program, whatever it is, and you throw it out some problem and then you see how long it takes to learn. And then you say, well, that took a while. Now, let another program modify this reinforcement learning program. After the modification, see how fast it learns. If it learns more quickly, that was a good modification and maybe keep it and improve from there. Well, I see, right. Yes, and pace the direction. I think it has a lot to do with, maybe, the amount of compute that's becoming available. So, this would be running reinforcement learning in the inner loop. For us right now, we run reinforcement learning as the final thing. And so, the more compute we get, the more it becomes possible to maybe run something like reinforcement learning in the inner loop of a bigger algorithm. Starting from the 14-year-old, you've worked in AI for some 20 plus years now. So, tell me a bit about how your understanding of AI has evolved over this time. When I started looking at AI, it's very interesting because it really coincided with coming to Stanford to do my master's degree there, and there were some icons there like John McCarthy who I got to talk with, but who had a very different approach to, and in the year 2000, for what most people were doing at the time. And also talking with Daphne Koller. And I think a lot of my initial thinking of AI was shaped by Daphne's thinking. Her AI class, her probabilistic graphical models class, and kind of really being intrigued by how simply a distribution of her many random variables and then being able to condition on some subsets variables and draw on conclusions about others could actually give you so much if you can somehow make it computationally attractable, which was definitely the challenge to make it computable. And then from there, when I started my Ph.D. And you arrived at Stanford, and I think you give me a really good reality check, that that's not the right metric to evaluate your work by, and to really try to see the connection from what you're working on to what impact they can really have, what change it can make rather than what's the math that happened to be in your work. Right. That's amazing. I did not realize, I've forgotten that. Yes, it's actually one of the things, aside most often that people asking, if you going to cite only one thing that has stuck with you from Andrew's advice, it's making sure you can see the connection to where it's actually going to do something. You've had and you're continuing to have an amazing career in AI. So, for some of the people listening to you on video now, if they want to also enter or pursue a career in AI, what advice do you have for them? I think it's a really good time to get into artificial intelligence. If you look at the demand for people, it's so high, there is so many job opportunities, so many things you can do, researchwise, build new companies and so forth. So, I'd say yes, it's definitely a smart decision in terms of actually getting going. A lot of it, you can self-study, whether you're in school or not. There is a lot of online courses, for instance, your machine learning course, there is also, for example, Andrej Karpathy's deep learning course which has videos online, which is a great way to get started, Berkeley who has a deep reinforcement learning course which has all of the lectures online. So, those are all good places to get started. I think a big part of what's important is to make sure you try things yourself. So, not just read things or watch videos but try things out. With frameworks like TensorFlow, Chainer, Theano, PyTorch and so forth, I mean whatever is your favorite, it's very easy to get going and get something up and running very quickly. To get to practice yourself, right? With implementing and seeing what does and seeing what doesn't work. So, this past week there was an article in Mashable about a 16-year-old in United Kingdom, who is one of the leaders on Kaggle competitions. And it just said, he just went out and learned things, found things online, learned everything himself and never actually took any formal course per se. And there is a 16-year-old just being very competitive in Kaggle competition, so it's definitely possible. We live in good times. If people want to learn. Absolutely. One question I bet you get all sometimes is if someone wants to enter AI machine learning and deep learning, should they apply for a Ph.D. program or should they get the job with a big company? I think a lot of it has to do with maybe how much mentoring you can get. So, in a Ph.D. program, you're such a guaranteed, the job of the professor, who is your adviser, is to look out for you. Try to do everything they can to, kind of, shape you, help you become stronger at whatever you want to do, for example, AI. And so, there is a very clear dedicated person, sometimes you have two advisers. And that's literally their job and that's why they are professors, most of what they like about being professors often is helping shape students to become more capable at things. Now, it doesn't mean it's not possible at companies, and many companies have really good mentors and have people who love to help educate people who come in and strengthen them, and so forth. It's just, it might not be as much of a guarantee and a given, compared to actually enrolling in a Ph.D. program or that's the crooks of the program is that you're going to learn and somebody is there to help you learn. So it really depends on the company and depends on the Ph.D. program. Absolutely, yes. But I think it is key that you can learn a lot on your own. But I think you can learn a lot faster if you have somebody who's more experienced, who is actually taking it up as their responsibility to spend time with you and help accelerate your progress. So, you've been one of the most visible leaders in deep reinforcement learning. So, what are the things that deep reinforcement learning is already working really well at? I think, if you look at some deep reinforcement learning successes, it's very, very intriguing. For example, learning to play Atari games from pixels, processing this pixels which is just numbers that are being processed somehow and turned into joystick actions. Then, for example, some of the work we did at Berkeley was, we have a simulated robot inventing walking and the reward that it's given is as simple as the further you go north the better and the less hard you impact with the ground the better. And somehow it decides that walking slash running is the thing to invent whereas, nobody showed it, what walking is or running is. Or robot playing with children's stories and learn to kind of put them together, put a block into matching opening, and so forth. And so, I think it's really interesting that in all of these it's possible to learn from raw sensory inputs all the way to raw controls, for example, torques at the motors. But at the same time. So it is very interesting that you can have a single algorithm. For example, you know thrust is impulsive and you can learn, can have a robot learn to run, can have a robot learn to stand up, can have instead of a two legged robot, now you're swapping a four legged robot. You run the same reinforcement algorithm and it still learns to run. And so, there is no change in the reinforcement algorithm. It's very, very general. Same for the Atari games. DQN was the same DQN for every one of the games. But then, when it actually starts hitting the frontiers of what's not yet possible as well, it's nice it learns from scratch for each one of these tasks but would be even nicer if it could reuse things it's learned in the past; to learn even more quickly for the next task. And that's something that's still on the frontier and not yet possible. It always starts from scratch, essentially. How quickly, do you think, you see deep reinforcement learning get deployed in the robots around us, the robots they're getting deployed in the world today. I think in practice the realistic scenario is one where it starts with supervised learning, behavioral cloning; humans do the work. And I think a lot of businesses will be built that way where it's a human behind the scenes doing a lot of the work. Imagine Facebook Messenger assistant. Assistant like that could be built with a human behind the curtains doing a lot of the work; machine learning, matches up with what the human does and starts making suggestions to human so the humans has a small number of options that we can just click and select. And then over time, as it gets pretty good, you're starting fusing some reinforcement learning where you give it actual objectives, not just matching the human behind the curtains but giving objectives of achievement like, maybe, how fast were these two people able to plan their meeting? Or how fast were they able to book their flight? Or things like that. How long did it take? How happy were they with it? But it would probably have to be bootstrap of a lot of behavioral cloning of humans showing how this could be done. So it sounds behavioral cloning just supervise learning to mimic whatever the person is doing and then gradually later on, the reinforcement learning to have it think about longer time horizons? Is that a fair summary? I'd say so, yes. Just because straight up reinforcement learning from scratch is really fun to watch. It's super intriguing and very few things more fun to watch than a reinforcement learning robot starting from nothing and inventing things. But it's just time consuming and it's not always safe. Thank you very much. That was fascinating. I'm really glad we had the chance to chat. Well, Andrew thank you for having me. Very much appreciate it.
Shallow neural networks
Learn to build a neural network with one hidden layer, using forward propagation and backpropagation.
Neural Networks Overview - 4m
0:00
Welcome back. In this week, you learned to implement a neural network. Before diving into the technical details, I want in this video, to give you a quick overview of what you'll be seeing in this week's videos. So, if you don't follow all the details in this video, don't worry about it, we'll delve into the technical details in the next few videos. But for now, let's give a quick overview of how you implement in your network. Last week, we had talked about logistic regression, and we saw how this model corresponds to the following computation draft, where you didn't put the features x and parameters w and b that allows you to compute z which is then used to computes a, and we were using a interchangeably with this output y hat and then you can compute the loss function, L. A neural network looks like this. As I'd already previously alluded, you can form a neural network by stacking together a lot of little sigmoid units. Whereas previously, this node corresponds to two steps to calculations. The first is compute the z-value, second is it computes this a value. In this neural network, this stack of notes will correspond to a z-like calculation like this, as well as, an a-like calculation like that. Then, that node will correspond to another z and another a like calculation. So the notation which we will introduce later will look like this. First, we'll inputs the features, x, together with some parameters w and b, and this will allow you to compute z one. So, new notation that we'll introduce is that we'll use superscript square bracket one to refer to quantities associated with this stack of nodes, it's called a layer. Then later, we'll use superscript square bracket two to refer to quantities associated with that node. That's called another layer of the neural network. The superscript square brackets, like we have here, are not to be confused with the superscript round brackets which we use to refer to individual training examples. So, whereas x superscript round bracket I refer to the ith training example, superscript square bracket one and two refer to these different layers; layer one and layer two in this neural network. But so going on, after computing z_1 similar to logistic regression, there'll be a computation to compute a_1, and that's just sigmoid of z_1, and then you compute z_2 using another linear equation and then compute a_2. A_2 is the final output of the neural network and will also be used interchangeably with y-hat. So, I know that was a lot of details but the key intuition to take away is that whereas for logistic regression, we had this z followed by a calculation. In this neural network, here we just do it multiple times, as a z followed by a calculation, and a z followed by a calculation, and then you finally compute the loss at the end. You remember that for logistic regression, we had this backward calculation in order to compute derivatives or as you're computing your d a, d z and so on. So, in the same way, a neural network will end up doing a backward calculation that looks like this in which you end up computing da_2, dz_2, that allows you to compute dw_2, db_2, and so on. This right to left backward calculation that is denoting with the red arrows. So, that gives you a quick overview of what a neural network looks like. It's basically taken logistic regression and repeating it twice. I know there was a lot of new notation laws, new details, don't worry about saving them, follow everything, we'll go into the details most probably in the next few videos. So, let's go on to the next video. We'll start to talk about the neural network representation.
Neural Network Representation - 5m
0:00
You see me draw a few pictures of neural networks. In this video, we'll talk about exactly what those pictures means. In other words, exactly what those neural networks that we've been drawing represent. And we'll start with focusing on the case of neural networks with what was called a single hidden layer. Here's a picture of a neural network. Let's give different parts of these pictures some names. We have the input features, x1, x2, x3 stacked up vertically. And this is called the input layer of the neural network. So maybe not surprisingly, this contains the inputs to the neural network. Then there's another layer of circles. And this is called a hidden layer of the neural network. I'll come back in a second to say what the word hidden means. But the final layer here is formed by, in this case, just one node. And this single-node layer is called the output layer, and is responsible for generating the predicted value y hat. In a neural network that you train with supervised learning, the training set contains values of the inputs x as well as the target outputs y. So the term hidden layer refers to the fact that in the training set, the true values for these nodes in the middle are not observed. That is, you don't see what they should be in the training set. You see what the inputs are. You see what the output should be. But the things in the hidden layer are not seen in the training set. So that kind of explains the name hidden layer; just because you don't see it in the training set. Let's introduce a bit more notation. Whereas previously, we were using the vector X to denote the input features and alternative notation for the values of the input features will be A superscript square bracket 0. And the term A also stands for activations, and it refers to the values that different layers of the neural network are passing on to the subsequent layers. So the input layer passes on the value x to the hidden layer, so we're going to call that activations of the input layer A super script 0. The next layer, the hidden layer, will in turn generate some set of activations, which I'm going to write as A superscript square bracket 1. So in particular, this first unit or this first node, we generate a value A superscript square bracket 1 subscript 1. This second node we generate a value. Now we have a subscript 2 and so on. And so, A superscript square bracket 1, this is a four dimensional vector you want in Python because the 4x1 matrix, or a 4 column vector, which looks like this. And it's four dimensional, because in this case we have four nodes, or four units, or four hidden units in this hidden layer. And then finally, the open layer regenerates some value A2, which is just a real number. And so y hat is going to take on the value of A2. So this is analogous to how in logistic regression we have y hat equals a and in logistic regression which we only had that one output layer, so we don't use the superscript square brackets. But with our neural network, we now going to use the superscript square bracket to explicitly indicate which layer it came from. One funny thing about notational conventions in neural networks is that this network that you've seen here is called a two layer neural network. And the reason is that when we count layers in neural networks, we don't count the input layer. So the hidden layer is layer one and the output layer is layer two. In our notational convention, we're calling the input layer layer zero, so technically maybe there are three layers in this neural network. Because there's the input layer, the hidden layer, and the output layer. But in conventional usage, if you read research papers and elsewhere in the course, you see people refer to this particular neural network as a two layer neural network, because we don't count the input layer as an official layer. Finally, something that we'll get to later is that the hidden layer and the output layers will have parameters associated with them. So the hidden layer will have associated with it parameters w and b. And I'm going to write superscripts square bracket 1 to indicate that these are parameters associated with layer one with the hidden layer. We'll see later that w will be a 4 by 3 matrix and b will be a 4 by 1 vector in this example. Where the first coordinate four comes from the fact that we have four nodes of our hidden units and a layer, and three comes from the fact that we have three input features. We'll talk later about the dimensions of these matrices. And it might make more sense at that time. But in some of the output layers has associated with it also, parameters w superscript square bracket 2 and b superscript square bracket 2. And it turns out the dimensions of these are 1 by 4 and 1 by 1. And these 1 by 4 is because the hidden layer has four hidden units, the output layer has just one unit. But we will go over the dimension of these matrices and vectors in a later video. So you've just seen what a two layered neural network looks like. That is a neural network with one hidden layer. In the next video, let's go deeper into exactly what this neural network is computing. That is how this neural network inputs x and goes all the way to computing its output y hat.
Computing a Neural Network's Output - 9m
0:00
In the last video, you saw what a single hidden layer neural network looks like. In this video, let's go through the details of exactly how this neural network computes these outputs. What you see is that is like logistic regression, the repeater a lot of times. Let's take a look. So, this is what a two-layer neural network looks. Let's go more deeply into exactly what this neural network computes. Now, we've said before that logistic regression, the circle in logistic regression, really represents two steps of computation rows. You compute z as follows, and a second, you compute the activation as a sigmoid function of z. So, a neural network just does this a lot more times. Let's start by focusing on just one of the nodes in the hidden layer. Let's look at the first node in the hidden layer. So, I've grayed out the other nodes for now. So, similar to logistic regression on the left, this nodes in the hidden layer does two steps of computation. The first step and think of as the left half of this node, it computes z equals w transpose x plus b, and the notation we'll use is, these are all quantities associated with the first hidden layer. So, that's why we have a bunch of square brackets there. This is the first node in the hidden layer. So, that's why we have the subscript one over there. So first, it does that, and then the second step, is it computes a_[1]1 equals sigmoid of z[1]1, like so. So, for both z and a, the notational convention is that a, l, i, the l here in superscript square brackets, refers to the layer number, and the i subscript here, refers to the nodes in that layer. So, the node we'll be looking at is layer one, that is a hidden layer node one. So, that's why the superscripts and subscripts were both one, one. So, that little circle, that first node in the neural network, represents carrying out these two steps of computation. Now, let's look at the second node in the neural network, or the second node in the hidden layer of the neural network. Similar to the logistic regression unit on the left, this little circle represents two steps of computation. The first step is it computes z. This is still layer one, but now as a second node equals w transpose x, plus b[1]2, and then a[1] two equals sigmoid of z_[1]2. Again, feel free to pause the video if you want, but you can double-check that the superscript and subscript notation is consistent with what we have written here above in purple. So, we've talked through the first two hidden units in a neural network, having units three and four also represents some computations. So now, let me take this pair of equations, and this pair of equations, and let's copy them to the next slide. So, here's our neural network, and here's the first, and here's the second equations that we've worked out previously for the first and the second hidden units. If you then go through and write out the corresponding equations for the third and fourth hidden units, you get the following. So, let me show this notation is clear, this is the vector w[1]1, this is a vector transpose times x. So, that's what the superscript T there represents. It's a vector transpose. Now, as you might have guessed, if you're actually implementing a neural network, doing this with a for loop, seems really inefficient. So, what we're going to do, is take these four equations and vectorize. So, we're going to start by showing how to compute z as a vector, it turns out you could do it as follows. Let me take these w's and stack them into a matrix, then you have w[1]1 transpose, so that's a row vector, or this column vector transpose gives you a row vector, then w[1]2, transpose, w[1]3 transpose, w[1]4 transpose. So, by stacking those four w vectors together, you end up with a matrix. So, another way to think of this is that we have four logistic regression units there, and each of the logistic regression units, has a corresponding parameter vector, w. By stacking those four vectors together, you end up with this four by three matrix. So, if you then take this matrix and multiply it by your input features x1, x2, x3, you end up with by how matrix multiplication works. You end up with w[1]1 transpose x, w_2[1] transpose x, w_3_[1] transpose x, w_4_[1] transpose x. Then, let's not figure the b's. So, we now add to this a vector b_[1]1 one, b[1]2, b[1]3, b[1]4. So, that's basically this, then this is b[1]1, b[1]2, b[1]3, b[1]4. So, you see that each of the four rows of this outcome correspond exactly to each of these four rows, each of these four quantities that we had above. So, in other words, we've just shown that this thing is therefore equal to z[1]1, z[1]2, z[1]3, z[1]4, as defined here. Maybe not surprisingly, we're going to call this whole thing, the vector z[1], which is taken by stacking up these individuals of z's into a column vector. When we're vectorizing, one of the rules of thumb that might help you navigate this, is that while we have different nodes in the layer, we'll stack them vertically. So, that's why we have z_[1]1 through z[1]4, those corresponded to four different nodes in the hidden layer, and so we stacked these four numbers vertically to form the vector z[1]. To use one more piece of notation, this four by three matrix here which we obtained by stacking the lowercase w[1]1, w[1]2, and so on, we're going to call this matrix W capital [1]. Similarly, this vector, we're going to call b superscript [1] square bracket. So, this is a four by one vector. So now, we've computed z using this vector matrix notation, the last thing we need to do is also compute these values of a. So, prior won't surprise you to see that we're going to define a[1], as just stacking together, those activation values, a [1], 1 through a [1], 4. So, just take these four values and stack them together in a vector called a[1]. This is going to be a sigmoid of z[1], where this now has been implementation of the sigmoid function that takes in the four elements of z, and applies the sigmoid function element-wise to it. So, just a recap, we figured out that z_[1] is equal to w_[1] times the vector x plus the vector b_[1], and a_[1] is sigmoid times z_[1]. Let's just copy this to the next slide. What we see is that for the first layer of the neural network given an input x, we have that z_[1] is equal to w_[1] times x plus b_[1], and a_[1] is sigmoid of z_[1]. The dimensions of this are four by one equals, this was a four by three matrix times a three by one vector plus a four by one vector b, and this is four by one same dimension as end. Remember, that we said x is equal to a_[0]. Just say y hat is also equal to a two. If you want, you can actually take this x and replace it with a_[0], since a_[0] is if you want as an alias for the vector of input features, x. Now, through a similar derivation, you can figure out that the representation for the next layer can also be written similarly where what the output layer does is, it has associated with it, so the parameters w_[2] and b_[2]. So, w_[2] in this case is going to be a one by four matrix, and b_[2] is just a real number as one by on. So, z_[2] is going to be a real number we'll write as a one by one matrix. Is going to be a one by four thing times a was four by one, plus b_[2] as one by one, so this gives you just a real number. If you think of this last upper unit as just being analogous to logistic regression which have parameters w and b, w really plays an analogous role to w_[2] transpose, or w_[2] is really W transpose and b is equal to b_[2]. I said we want to cover up the left of this network and ignore all that for now, then this last upper unit is a lot like logistic regression, except that instead of writing the parameters as w and b, we're writing them as w_[2] and b_[2], with dimensions one by four and one by one. So, just a recap. For logistic regression, to implement the output or to implement prediction, you compute z equals w transpose x plus b, and a or y hat equals a, equals sigmoid of z. When you have a neural network with one hidden layer, what you need to implement, is to computer this output is just these four equations. You can think of this as a vectorized implementation of computing the output of first these for logistic regression units in the hidden layer, that's what this does, and then this logistic regression in the output layer which is what this does. I hope this description made sense, but the takeaway is to compute the output of this neural network, all you need is those four lines of code. So now, you've seen how given a single input feature, vector a, you can with four lines of code, compute the output of this neural network. Similar to what we did for logistic regression, we'll also want to vectorize across multiple training examples. We'll see that by stacking up training examples in different columns in the matrix, with just slight modification to this, you also, similar to what you saw in this regression, be able to compute the output of this neural network, not just a one example at a time, prolong your, say your entire training set at a time. So, let's see the details of that in the next video.
Vectorizing across multiple examples - 9m
0:00
In the last video, you saw how to compute the prediction on a neural network, given a single training example. In this video, you see how to vectorize across multiple training examples. And the outcome will be quite similar to what you saw for logistic regression. Whereby stacking up different training examples in different columns of the matrix, you'd be able to take the equations you had from the previous video. And with very little modification, change them to make the neural network compute the outputs on all the examples on pretty much all at the same time. So let's see the details on how to do that. These were the four equations we have from the previous video of how you compute z1, a1, z2 and a2. And they tell you how, given an input feature back to x, you can use them to generate a2 =y hat for a single training example.
0:54
Now if you have m training examples, you need to repeat this process for say, the first training example. x superscript (1) to compute y hat 1 does a prediction on your first training example. Then x(2) use that to generate prediction y hat (2). And so on down to x(m) to generate a prediction y hat (m). And so in all these activation function notation as well, I'm going to write this as a2. And this is a2, and a(2)(m), so this notation a2. The round bracket i refers to training example i, and the square bracket 2 refers to layer 2, okay.
1:58
So that's how the square bracket and the round bracket indices work.
2:04
And so to suggest that if you have an unvectorized implementation and want to compute the predictions of all your training examples, you need to do for i = 1 to m. Then basically implement these four equations, right? You need to make a z1 = W(1) x(i) + b[1], a1 = sigma of z1. z2 = w[2]a1 + b[2] andZ2i equals w2a1i plus b2 and a2 = sigma point of z2. So it's basically these four equations on top by adding the superscript round bracket i to all the variables that depend on the training example. So adding this superscript round bracket i to x is z and a, if you want to compute all the outputs on your m training examples examples. What we like to do is vectorize this whole computation, so as to get rid of this for. And by the way, in case it seems like I'm getting a lot of nitty gritty linear algebra, it turns out that being able to implement this correctly is important in the deep learning era. And we actually chose notation very carefully for this course and make this vectorization steps as easy as possible. So I hope that going through this nitty gritty will actually help you to more quickly get correct implementations of these algorithms working.
3:51
All right so let me just copy this whole block of code to the next slide and then we'll see how to vectorize this.
3:59
So here's what we have from the previous slide with the for loop going over our m training examples. So recall that we defined the matrix x to be equal to our training examples stacked up in these columns like so. So take the training examples and stack them in columns. So this becomes a n, or maybe nx by m diminish the matrix.
4:29
I'm just going to give away the punch line and tell you what you need to implement in order to have a vectorized implementation of this for loop. It turns out what you need to do is compute Z[1] = W[1] X + b[1], A[1]= sig point of z[1]. Then Z[2] = w[2] A[1] + b[2] and then A[2] = sig point of Z[2]. So if you want the analogy is that we went from lower case vector xs to just capital case X matrix by stacking up the lower case xs in different columns. If you do the same thing for the zs, so for example, if you take z1, z1, and so on, and these are all column vectors, up to z1, right. So that's this first quantity that all m of them, and stack them in columns. Then just gives you the matrix z[1]. And similarly you look at say this quantity and take a1, a1 and so on and a1, and stacked them up in columns. Then this, just as we went from lower case x to capital case X, and lower case z to capital case Z. This goes from the lower case a, which are vectors to this capital A[1], that's over there and similarly, for z[2] and a[2]. Right they're also obtained by taking these vectors and stacking them horizontally. And taking these vectors and stacking them horizontally, in order to get Z[2], and E[2]. One of the property of this notation that might help you to think about it is that this matrixes say Z and A, horizontally we're going to index across training examples. So that's why the horizontal index corresponds to different training example, when you sweep from left to right you're scanning through the training cells. And vertically this vertical index corresponds to different nodes in the neural network. So for example, this node, this value at the top most, top left most corner of the mean corresponds to the activation of the first heading unit on the first training example. One value down corresponds to the activation in the second hidden unit on the first training example, then the third heading unit on the first training sample and so on. So as you scan down this is your indexing to the hidden units number.
7:39
Whereas if you move horizontally, then you're going from the first hidden unit. And the first training example to now the first hidden unit and the second training sample, the third training example. And so on until this node here corresponds to the activation of the first hidden unit on the final train example and the nth training example.
8:00
Okay so the horizontally the matrix A goes over different training examples.
8:10
And vertically the different indices in the matrix A corresponds to different hidden units.
8:22
And a similar intuition holds true for the matrix Z as well as for X where horizontally corresponds to different training examples. And vertically it corresponds to different input features which are really different than those of the input layer of the neural network.
8:42
So of these equations, you now know how to implement in your network with vectorization, that is vectorization across multiple examples. In the next video I want to show you a bit more justification about why this is a correct implementation of this type of vectorization. It turns out the justification would be similar to what you had seen [INAUDIBLE]. Let's go on to the next video.
Explanation for Vectorized Implementation - 7m
0:00
In the previous video, we saw how with your training examples stacked up horizontally in the matrix x, you can derive a vectorized implementation for propagation through your neural network. Let's give a bit more justification for why the equations we wrote down is a correct implementation of vectorizing across multiple examples. So let's go through part of the propagation calculation for the few examples. Let's say that for the first training example, you end up computing this x1 plus b1 and then for the second training example, you end up computing this x2 plus b1 and then for the third training example, you end up computing this 3 plus b1. So, just to simplify the explanation on this slide, I'm going to ignore b. So let's just say, to simplify this justification a little bit that b is equal to zero. But the argument we're going to lay out will work with just a little bit of a change even when b is non-zero. It does just simplify the description on the slide a bit. Now, w1 is going to be some matrix, right? So I have some number of rows in this matrix. So if you look at this calculation x1, what you have is that w1 times x1 gives you some column vector which you must draw like this. And similarly, if you look at this vector x2, you have that w1 times x2 gives some other column vector, right? And that's gives you this z12. And finally, if you look at x3, you have w1 times x3, gives you some third column vector, that's this z13. So now, if you consider the training set capital X, which we form by stacking together all of our training examples. So the matrix capital X is formed by taking the vector x1 and stacking it vertically with x2 and then also x3. This is if we have only three training examples. If you have more, you know, they'll keep stacking horizontally like that. But if you now take this matrix x and multiply it by w then you end up with, if you think about how matrix multiplication works, you end up with the first column being these same values that I had drawn up there in purple. The second column will be those same four values. And the third column will be those orange values, what they turn out to be. But of course this is just equal to z11 expressed as a column vector followed by z12 expressed as a column vector followed by z13, also expressed as a column vector. And this is if you have three training examples. You get more examples then there'd be more columns. And so, this is just our matrix capital Z1. So I hope this gives a justification for why we had previously w1 times xi equals z1i when we're looking at single training example at the time. When you took the different training examples and stacked them up in different columns, then the corresponding result is that you end up with the z's also stacked at the columns. And I won't show but you can convince yourself if you want that with Python broadcasting, if you add back in, these values of b to the values are still correct. And what actually ends up happening is you end up with Python broadcasting, you end up having bi individually to each of the columns of this matrix. So on this slide, I've only justified that z1 equals w1x plus b1 is a correct vectorization of the first step of the four steps we have in the previous slide, but it turns out that a similar analysis allows you to show that the other steps also work on using a very similar logic where if you stack the inputs in columns then after the equation, you get the corresponding outputs also stacked up in columns. Finally, let's just recap everything we talked about in this video. If this is your neural network, we said that this is what you need to do if you were to implement for propagation, one training example at a time going from i equals 1 through m. And then we said, let's stack up the training examples in columns like so and for each of these values z1, a1, z2, a2, let's stack up the corresponding columns as follows. So this is an example for a1 but this is true for z1, a1, z2, and a2. Then what we show on the previous slide was that this line allows you to vectorize this across all m examples at the same time. And it turns out with the similar reasoning, you can show that all of the other lines are correct vectorizations of all four of these lines of code. And just as a reminder, because x is also equal to a0 because remember that the input feature vector x was equal to a0, so xi equals a0i. Then there's actually a certain symmetry to these equations where this first equation can also be written z1 equals w1 a0 plus b1. And so, you see that this pair of equations and this pair of equations actually look very similar but just of all of the indices advance by one. So this kind of shows that the different layers of a neural network are roughly doing the same thing or just doing the same computation over and over. And here we have two-layer neural network where we go to a much deeper neural network in next week's videos. You see that even deeper neural networks are basically taking these two steps and just doing them even more times than you're seeing here. So that's how you can vectorize your neural network across multiple training examples. Next, we've so far been using the sigmoid functions throughout our neural networks. It turns out that's actually not the best choice. In the next video, let's dive a little bit further into how you can use different, what's called, activation functions of which the sigmoid function is just one possible choice.
Activation functions - 10m
0:00
When you build your neural network, one of the choices you get to make is what activation function to use in the hidden layers, as well as what is the output units of your neural network. So far, we've just been using the sigmoid activation function. But sometimes other choices can work much better. Let's take a look at some of the options. In the fourth propagation steps for the neural network, we have these three steps where we use the sigmoid function here. So that sigmoid is called an activation function. And here's the familiar sigmoid function, a equals one over one plus e to the negative z. So in the more general case, we can have a different function,
0:45
g of z, which I'm going to write here,
0:51
where g could be a nonlinear function that may not be the sigmoid function. So for example, the sigmoid function goes within zero and one, and activation function that almost always works better than the sigmoid function is the tangent function or the hyperbolic tangent function. So this is z, this is a, this is a equals tanh(z), and this goes between plus 1 and minus 1. The formula for the tanh function is e to the z minus e to the negative z over their sum. And is actually mathematically, a shifted version of the sigmoid function. So, as a sigmoid function just like that, but shifted so that it now crosses a zero zero point and v scale, so it goes 15 minus 1 and plus 1. And it turns out for hidden units, if you let the function g of z be equal to tanh(z), this almost always works better than the sigmoid function because the values between plus 1 and minus 1, the mean of the activations that come out of your head, and they are closer to having a 0 mean. And so just as sometimes when you train a learning algorithm, you might center the data and have your data have 0 mean using a tanh instead of a sigmoid function. It kind of has the effect of centering your data so that the mean of your data is closer to 0 rather than, maybe 0.5. And this actually makes learning for the next layer a little bit easier. We'll say more about this in the second course when we talk about optimization algorithms as well. But one takeaway is that I pretty much never use the sigmoid activation function anymore. The tanh function is almost always strictly superior. The one exception is for the output layer because if y is either 0 or 1, then it makes sense for y hat to be a number, the one to output that's between 0 and 1 rather than between minus 1 and 1. So the one exception where I would use the sigmoid activation function is when you are using binary classification, in which case you might use the sigmoid activation function for the output layer. So g of z 2 here is equal to sigma of z 2. And so what you see in this example is where you might have a tanh activation function for the hidden layer, and sigmoid for the output layer. So deactivation functions can be different for different layers. And sometimes to note that activation functions are different for different layers, we might use these square bracket superscripts as well to indicate that g of square bracket one may be different than g of square bracket two. And again, square bracket one superscript refers to this layer, and superscript square bracket two refers to the output layer.
4:13
Now, one of the downsides of both the sigmoid function and the tanh function is that if z is either very large or very small, then the gradient or the derivative or the slope of this function becomes very small. So if z is very large or z is very small, the slope of the function ends up being close to 0. And so this can slow down gradient descent.
4:36
So one other choice that is very popular in machine learning is what's called the rectify linear unit. So the value function looks like this.
4:50
And the formula is a = max(0,z). So the derivative is 1, so long as z is positive. And the derivative or the slope is 0, when z is negative. If you're implementing this, technically the derivative when z is exactly 0 is not well defined. But when you implement this in the computer, the answer you get exactly is z equals 0000000000000. It's very small so you don't need to worry about it in practice. You could pretend the derivative, when z is equal to 0, you can pretend it's either 1 or 0 and then you kind of work just fine. So the fact that it's not differentiable, and the fact that, so here are some rules of thumb for choosing activation functions. If your output is 0, 1 value, if you're using binary classification, then the sigmoid activation function is a very natural choice for the output layer. And then for all other unit's ReLU, or the rectified linear unit,
6:02
Is increasingly the default choice of activation function. So if you're not sure what to use for your hidden layer, I would just use the ReLU activation function. It's what you see most people using most days. Although sometimes people also use the tanh activation function. One disadvantage of the ReLU is that the derivative is equal to zero, when z is negative. In practice, this works just fine. But there is another version of the ReLU called the leaky ReLU. I will give you the formula on the next slide. But instead of it being 0 when z is negative, it just takes a slight slope like so, so this is called the leaky ReLU.
6:45
This usually works better than the ReLU activation function, although it's just not used as much in practice. Either one should be fine, although, if you had to pick one, I usually just use the ReLU. And the advantage of both the ReLU and the leaky ReLU is that for a lot of the space of Z, the derivative of the activation function, the slope of the activation function is very different from 0. And so in practice, using the ReLU activation function, your neural network will often learn much faster than when using the tanh or the sigmoid activation function. And the main reason is that there is less of these effects of the slope of the function going to 0, which slows down learning. And I know that for half of the range of z, the slope of ReLU is 0, but in practice, enough of your hidden units will have z greater than 0. So learning can still be quite fast for most training examples. So let's just quickly recap the pros and cons of different activation functions. Here's a sigmoid activation function. I will say never use this, except for the output layer, if you are doing binary classification, or maybe almost never use this.
7:57
And the reason I almost never use this is because the tanh is pretty much strictly superior. So the tanh activation function is this.
8:11
And then the default, the most commonly used activation function is the ReLU, which is this.
8:18
So if you're not sure what else to use, use this one, and maybe feel free also to try the leaky ReLU. Where it might be (0.01 z, z). Right? So a is the max of 0.01 times z and z, so that gives you these some bends in the function. And you might say, why is that constant 0.01? Well, you can also make that another parameter of the learning algorithm. And some people say that works even better. But i hardly see people do that.
8:58
But if you feel like trying that in your application, please feel free to do so. And you can just see how it works, and how well it works, and stick with it if it gives you a good result. So I hope that gives you a sense of some of the choices of activation functions you can use in your neural network. One of the themes we'll see in deep learning is that you often have a lot of different choices in how you code your neural network. Ranging from number of hidden units, to the choice activation function, to how you initialize the ways which we'll see later. A lot of choices like that. And it turns out that it's sometimes difficult to get good guidelines for exactly what would work best for your problem. So throughout these courses I keep on giving you a sense of what I see in the industry in terms of what's more or less popular. But for your application, with your application's idiosyncrasies, it's actually very difficult to know in advance exactly what will work best. So a common piece of advice would be, if you're not sure which one of these activation functions work best, try them all, and evaluate on a holdout validation set, or a development set, which we'll talk about later, and see which one works better, and then go with that. And I think that by testing these different choices for your application, you'd be better at future-proofing your neural network architecture against the idiosyncracies of your problem, as well as evolutions of the algorithms. Rather than if I were to tell you always use a ReLU activation and don't use anything else. That just may or may not apply for whatever problem you end up working on either in the near future or in the distant future. All right, so that was the choice of activation functions and you've seen the most popular activation functions. There's one other question that sometimes you could ask, which is, why do you even need to use an activation function at all? Why not just do away with that? So let's talk about that in the next video, where you see why neural networks do need some sort of nonlinear activation function.
Why do you need non-linear activation functions? - 5m
0:00
Why does a neural network need a non-linear activation function? Turns out that your neural network to compute interesting functions, you do need to pick a non-linear activation function, let's see one. So, here's the four prop equations for the neural network. Why don't we just get rid of this? Get rid of the function g? And set a1 equals z1. Or alternatively, you can say that g of z is equal to z, all right? Sometimes this is called the linear activation function. Maybe a better name for it would be the identity activation function because it just outputs whatever was input. For the purpose of this, what if a(2) was just equal z(2)? It turns out if you do this, then this model is just computing y or y-hat as a linear function of your input features, x, to take the first two equations. If you have that a(1) = Z(1) = W(1)x + b, and then a(2) = z (2) = W(2)a(1) + b. Then if you take this definition of a1 and plug it in there, you find that a2 = w2(w1x + b1), move that up a bit. Right? So this is a1 + b2, and so this simplifies to: (W2w1)x + (w2b1 + b2).
1:58
So this is just, let's call this w' b'. SO this is just equal to w' x + b'. If you were to use linear activation functions or we can also call them identity activation functions, then the neural network is just outputting a linear function of the input.
2:23
And we'll talk about deep networks later, neural networks with many, many layers, many hidden layers. And it turns out that if you use a linear activation function or alternatively, if you don't have an activation function, then no matter how many layers your neural network has, all it's doing is just computing a linear activation function. So you might as well not have any hidden layers.
2:47
Some of the cases that are briefly mentioned, it turns out that if you have a linear activation function here and a sigmoid function here, then this model is no more expressive than standard logistic regression without any hidden layer. So I won't bother to prove that, but you could try to do so if you want. But the take home is that a linear hidden layer is more or less useless because the composition of two linear functions is itself a linear function.
3:17
So unless you throw a non-linear [INAUDIBLE] in there, then you're not computing more interesting functions even as you go deeper in the network. There is just one place where you might use a linear activation function. g(x) = z. And that's if you are doing machine learning on the regression problem. So if y is a real number. So for example, if you're trying to predict housing prices. So y is not 0, 1, but is a real number, anywhere from - I don't know - $0 is the price of house up to however expensive, right, houses get, I guess. Maybe houses can be potentially millions of dollars, so however much houses cost in your data set. But if y takes on these real values,
4:10
then it might be okay to have a linear activation function here so that your output y hat is also a real number going anywhere from minus infinity to plus infinity.
4:24
But then the hidden units should not use the activation functions. They could use ReLU or tanh or Leaky ReLU or maybe something else. So the one place you might use a linear activation function is usually in the output layer. But other than that, using a linear activation function in the hidden layer except for some very special circumstances relating to compression that we're going to talk about using the linear activation function is extremely rare. And, of course, if we're actually predicting housing prices, as you saw in the week one video, because housing prices are all non-negative, Perhaps even then you can use a value activation function so that your output y-hats are all greater than or equal to 0. So I hope that gives you a sense of why having a non-linear activation function is a critical part of neural networks. Next we're going to start to talk about gradient descent and to do that to set up for our discussion for gradient descent, in the next video I want to show you how to estimate-how to compute-the slope or the derivatives of individual activation functions. So let's go on to the next video.
Derivatives of activation functions - 7m
0:00
When you implement back propagation for your neural network, you need to either compute the slope or the derivative of the activation functions. So, let's take a look at our choices of activation functions and how you can compute the slope of these functions. Here's the familiar Sigmoid activation function. So, for any given value of z, maybe this value of z. This function will have some slope or some derivative corresponding to, if you draw a little line there, the height over width of this lower triangle here. So, if g of z is the sigmoid function, then the slope of the function is d, dz g of z, and so we know from calculus that it is the slope of g of x at z. If you are familiar with calculus and know how to take derivatives, if you take the derivative of the Sigmoid function, it is possible to show that it is equal to this formula. Again, I'm not going to do the calculus steps, but if you are familiar with calculus, feel free to post a video and try to prove this yourself. So, this is equal to just g of z, times 1 minus g of z. So, let's just sanity check that this expression make sense. First, if z is very large, so say z is equal to 10, then g of z will be close to 1, and so the formula we have on the left tells us that d dz g of z does be close to g of z, which is equal to 1 times 1 minus 1, which is therefore very close to 0. This isn't the correct because when z is very large, the slope is close to 0. Conversely, if z is equal to minus 10, so it says well there, then g of z is close to 0. So, the formula on the left tells us d dz g of z would be close to g of z, which is 0 times 1 minus 0. So it is also very close to 0, which is correct. Finally, if z is equal to 0, then g of z is equal to one-half, that's the sigmoid function right here, and so the derivative is equal to one-half times 1 minus one-half, which is equal to one-quarter, and that actually turns out to be the correct value of the derivative or the slope of this function when z is equal to 0. Finally, just to introduce one more piece of notation, sometimes instead of writing this thing, the shorthand for the derivative is g prime of z. So, g prime of z in calculus, the little dash on top is called prime, but so g prime of z is a shorthand for the calculus for the derivative of the function of g with respect to the input variable z. Then in a neural network, we have a equals g of z, equals this, then this formula also simplifies to a times 1 minus a. So, sometimes in implementation, you might see something like g prime of z equals a times 1 minus a, and that just refers to the observation that g prime, which just means the derivative, is equal to this over here. The advantage of this formula is that if you've already computed the value for a, then by using this expression, you can very quickly compute the value for the slope for g prime as well. All right. So, that was the sigmoid activation function. Let's now look at the Tanh activation function. Similar to what we had previously, the definition of d dz g of z is the slope of g of z at a particular point of z, and if you look at the formula for the hyperbolic tangent function, and if you know calculus, you can take derivatives and show that this simplifies to this formula and using the shorthand we have previously when we call this g prime of z again. So, if you want you can sanity check that this formula makes sense. So, for example, if z is equal to 10, Tanh of z will be very close to 1. This goes from plus 1 to minus 1. Then g prime of z, according to this formula, would be about 1 minus 1 squared, so there's very close to 0. So, that was if z is very large, the slope is close to 0. Conversely, if z is very small, say z is equal to minus 10, then Tanh of z will be close to minus 1, and so g prime of z will be close to 1 minus negative 1 squared. So, it's close to 1 minus 1, which is also close to 0. Then finally, if z is equal to 0, then Tanh of z is equal to 0, and then the slope is actually equal to 1, which is actually the slope when z is equal to 0. So, just to summarize, if a is equal to g of z, so if a is equal to this Tanh of z, then the derivative, g prime of z, is equal to 1 minus a squared. So, once again, if you've already computed the value of a, you can use this formula to very quickly compute the derivative as well. Finally, here's how you compute the derivatives for the ReLU and Leaky ReLU activation functions. For the value g of z is equal to max of 0,z, so the derivative is equal to, turns out to be 0 , if z is less than 0 and 1 if z is greater than 0. It's actually undefined, technically undefined if z is equal to exactly 0. But if you're implementing this in software, it might not be a 100 percent mathematically correct, but it'll work just fine if z is exactly a 0, if you set the derivative to be equal to 1. It always had to be 0, it doesn't matter. If you're an expert in optimization, technically, g prime then becomes what's called a sub-gradient of the activation function g of z, which is why gradient descent still works. But you can think of it as that, the chance of z being exactly 0.000000. It's so small that it almost doesn't matter where you set the derivative to be equal to when z is equal to 0. So, in practice, this is what people implement for the derivative of z. Finally, if you are training a neural network with a Leaky ReLU activation function, then g of z is going to be max of say 0.01 z, z, and so, g prime of z is equal to 0.01 if z is less than 0 and 1 if z is greater than 0. Once again, the gradient is technically not defined when z is exactly equal to 0, but if you implement a piece of code that sets the derivative or that sets g prime to either 0.01 or or to 1, either way, it doesn't really matter. When z is exactly 0, your code will work just. So, under these formulas, you should either compute the slopes or the derivatives of your activation functions. Now, we have this building block, you're ready to see how to implement gradient descent for your neural network. Let's go on to the next video to see that.
Gradient descent for Neural Networks - 9m
0:00
All right. I think this'll be an exciting video. In this video, you'll see how to implement gradient descent for your neural network with one hidden layer. In this video, I'm going to just give you the equations you need to implement in order to get back-propagation or to get gradient descent working, and then in the video after this one, I'll give some more intuition about why these particular equations are the accurate equations, are the correct equations for computing the gradients you need for your neural network. So, your neural network, with a single hidden layer for now, will have parameters W1, B1, W2, and B2. So, as a reminder, if you have NX or alternatively N0 input features, and N1 hidden units, and N2 output units in our examples. So far I've only had N2 equals one, then the matrix W1 will be N1 by N0. B1 will be an N1 dimensional vector, so we can write that as N1 by one-dimensional matrix, really a column vector. The dimensions of W2 will be N2 by N1, and the dimension of B2 will be N2 by one. Right, so far we've only seen examples where N2 is equal to one, where you have just one single hidden unit. So, you also have a cost function for a neural network. For now, I'm just going to assume that you're doing binary classification. So, in that case, the cost of your parameters as follows is going to be one over M of the average of that loss function. So, L here is the loss when your neural network predicts Y hat, right. This is really A2 when the gradient label is equal to Y. If you're doing binary classification, the loss function can be exactly what you use for logistic regression earlier. So, to train the parameters of your algorithm, you need to perform gradient descent. When training a neural network, it is important to initialize the parameters randomly rather than to all zeros. We'll see later why that's the case, but after initializing the parameter to something, each loop or gradient descents with computed predictions. So, you basically compute your Y hat I, for I equals one through M, say. Then, you need to compute the derivative. So, you need to compute DW1, and that's the derivative of the cost function with respect to the parameter W1, you can compute another variable, shall I call DB1, which is the derivative or the slope of your cost function with respect to the variable B1 and so on. Similarly for the other parameters W2 and B2. Then finally, the gradient descent update would be to update W1 as W1 minus Alpha. The learning rate times D, W1. B1 gets updated as B1 minus the learning rate, times DB1, and similarly for W2 and B2. Sometimes, I use colon equals and sometimes equals, as either notation works fine. So, this would be one iteration of gradient descent, and then you repeat this some number of times until your parameters look like they're converging. So, in previous videos, we talked about how to compute the predictions, how to compute the outputs, and we saw how to do that in a vectorized way as well. So, the key is to know how to compute these partial derivative terms, the DW1, DB1 as well as the derivatives DW2 and DB2. So, what I'd like to do is just give you the equations you need in order to compute these derivatives. I'll defer to the next video, which is an optional video, to go greater into Jeff about how we came up with those formulas. So, let me just summarize again the equations for propagation. So, you have Z1 equals W1X plus B1, and then A1 equals the activation function in that layer applied element wise as Z1, and then Z2 equals W2, A1 plus V2, and then finally, just as all vectorized across your training set, right? A2 is equal to G2 of Z2. Again, for now, if we assume we're doing binary classification, then this activation function really should be the sigmoid function, same just for that end neural. So, that's the forward propagation or the left to right for computation for your neural network. Next, let's compute the derivatives. So, this is the back propagation step. Then I compute DZ2 equals A2 minus the gradient of Y, and just as a reminder, all this is vectorized across examples. So, the matrix Y is this one by M matrix that lists all of your M examples stacked horizontally. Then it turns out DW2 is equal to this, and in fact, these first three equations are very similar to gradient descents for logistic regression. X is equals one, comma, keep dims equals true. Just a little detail this np.sum is a Python NumPy command for summing across one-dimension of a matrix. In this case, summing horizontally, and what keepdims does is, it prevents Python from outputting one of those funny rank one arrays, right? Where the dimensions was your N comma. So, by having keepdims equals true, this ensures that Python outputs for DB a vector that is N by one. In fact, technically this will be I guess N2 by one. In this case, it's just a one by one number, so maybe it doesn't matter. But later on, we'll see when it really matters. So, so far what we've done is very similar to logistic regression. But now as you continue to run back propagation, you will compute this, DZ2 times G1 prime of Z1. So, this quantity G1 prime is the derivative of whether it was the activation function you use for the hidden layer, and for the output layer, I assume that you are doing binary classification with the sigmoid function. So, that's already baked into that formula for DZ2, and his times is element-wise product. So, this here is going to be an N1 by M matrix, and this here, this element-wise derivative thing is also going to be an N1 by N matrix, and so this times there is an element-wise product of two matrices. Then finally, DW1 is equal to that, and DB1 is equal to this, and p.sum DZ1 axis equals one, keepdims equals true. So, whereas previously the keepdims maybe matter less if N2 is equal to one. Result is just a one by one thing, is just a real number. Here, DB1 will be a N1 by one vector, and so you want Python, you want Np.sons. I'll put something of this dimension rather than a funny rank one array of that dimension which could end up messing up some of your data calculations. The other way would be to not have to keep the parameters, but to explicitly reshape the output of NP.sum into this dimension, which you would like DB to have. So, that was forward propagation in I guess four equations, and back-propagation in I guess six equations. I know I just wrote down these equations, but in the next optional video, let's go over some intuitions for how the six equations for the back propagation algorithm were derived. Please feel free to watch that or not. But either way, if you implement these algorithms, you will have a correct implementation of forward prop and back prop. You'll be able to compute the derivatives you need in order to apply gradient descent, to learn the parameters of your neural network. It is possible to implement this algorithm and get it to work without deeply understanding the calculus. A lot of successful deep learning practitioners do so. But, if you want, you can also watch the next video, just to get a bit more intuition of what the derivation of these equations.
Backpropagation intuition (optional) - 15m
0:00
In the last video, you saw the equations for back propagation. In this video, let's go over some intuition using the computation graph for how those equations were derived. This video is completely optional. So, feel free to watch or not. You should be able to do the whole work either way. So, recall that when we talk about logistic regression, we had this forward pass where we compute Z, then A and then the loss. And then to take the derivatives, we had this backward pass where we could first compute DA, and then go on to compute DZ, and then go on to compute DW and DB. So, the definition for the loss was L of A, Y equals negative Y log A minus one, minus Y times log one minus A. So, if you are familiar with calculus and you take the derivative of this with respect to A, that would give you the formula for DA. So, DA is equal to that. And if we actually figure out the calculus you could show that this is negative Y over A plus one minus Y over one minus A. You just kind of divide that from calculus by taking derivatives of this. It turns out when you take another step backwards to compute DZ, we did work out that DZ is equal to A minus Y. I did explain why previously, but it turns out that from the chamber of calculus DZ is equal to DA times G prime of Z. Where here G of Z equals sigmoid of Z is our activation function for this output unit in logistic regression, right? So, just remember this is still logistic regression where we have X1, X2, X3 and then just one sigmoid unit and that gives us A, will gives us Y end. So, here are the activation function was a sigmoid function. And as an aside, only for those of you familiar with the chamber of calculus the reason for this is because A is equal to sigmoid of Z. And so, partial of L with respect to Z is equal to partial of L with respect to A times DA, DZ. This is A is equal to sigmoid of Z, this is equal to DDZ, G of Z, which is equal to G prime of Z. So, that's why this expression which is DZ in our code is equal to this expression which is DA in our code times G prime of Z. And so this is just that. So, that last derivation would made sense only if you're familiar with calculus and specifically the chamber from calculus. But if not don't worry about it. I'll try to explain the intuition wherever it's needed. And then finally having computed DZ for this regression, we will compute DW which turns out was DZ times X and DB which is just DZ when you have a single training example. So, that was logistic regression. So, what we're going to do when computing back propagation for a neural network is a calculation a lot like this but only we'll do it twice because now we have not X going to an output unit, but X going to a hidden layer and then going to an output unit. And so instead of this computation being sort of one step as we have here, we'll have you two steps here in this kind of a neural network with two layers. So, in this two layer neural network that is we have the input layer, a hidden layer and then output layer. Remember the steps of a computation. First you compute Z1 using this equation, and then compute A1 and then you compute Z2. And notice Z2 also depends on the parameters W2 and B2. And then based on Z2, compute A2 and then finally that gives you the loss. What backpropagation does is it will go backward to compute DA2 and then DZ2, and then you go back to compute DW2 and DP2, go backwards to compute DA1, DZ1 and so on. We don't need to take the riveter as respect to the input X since the input X for supervised learning suffix. We're not trying to optimize X so we won't bother to take the riveters. At least, for supervised learning, we respect X. I'm going to skip explicitly computing DA2. If you want, you can actually compute DA2 and then use that to compute DZ2 but, in practice, you could collapse both of these steps into one step so you end up at DZ2= A2-Y, same as before. And, you have also, I'm going to write DW2 and DB2 down here below. You have that DW2=DZ2A1, transpose, and DB2=DZ2. This step is quite similar for logistic regression where we had that DW=DZX except that now, A1 plays the role of X and there's an extra transpose there because the relationship between the capital matrix W and our individual parameters W, there's a transpose there, right? Because W=[---] in the case of the logistic regression with a single output. DW2 is like that, whereas, W here was a column vector so that's why it has an extra transpose for A1, whereas, we didn't for X here for logistic regression. This completes half of backpropagation. Then, again, you can compute DA1 if you wish. Although, in practice, the computation for DA1 and the DZ1 are usually collapsed into one step and so what you'll actually implement is that DZ1=W2, transpose *DZ2, and then times an element Y's product of G1 prime of Z1. And, just to do a check on the dimensions, right? If you have a new network that looks like this, I'll put Y if so. If you have N0, NX=N0 input features, N1 head in units, and N2 so far. N2, in our case, just one output unit, then the matrix W2 is (N2,N1) dimensional, Z2 and therefore DZ2 are going to be (N2,N1) by one dimensional. This really is going to be a one by one when we are doing binary classification, and Z1 and therefore also DZ1 are going to be N1 by one dimensional, right? Note that for any variable foo and D foo always have the same dimension. That's why W and DW always have the same dimension and similarly, for B and DB and Z and DZ and so on. To make sure that the dimensions of this all match up, we have that DZ1=W2 transpose times DZ2 and then this is an element Y's product times G1 prime of Z1. Matching the dimensions from above, this is going to be N1 by one=W2 transpose, we transpose of this so there's going to be N1 by N2 dimensional. DZ2 is going to be N2 by one dimensional and then this, this is the same dimension as Z1. This is also N1 by one dimensional so element Y's product. The dimensions do make sense, right? N1 by one dimensional vector can be obtained by N1 by N2 dimensional matrix times N2 by N1 because the product of these two things gives you an N1 by one dimensional matrix and so this becomes the element Y's product of two N1 by one dimensional vectors, and so the dimensions do match. One tip when implementing a back prop. If you just make sure that the dimensions of your matrices match up, so you think through what are the dimensions of the various matrices including W1, W2, Z1, Z2, A1, A2 and so on and just make sure that the dimensions of these matrix operations match up, sometimes that will already eliminate quite a lot of bugs in back prop. All right. This gives us the DZ1 and then finally, just to wrap up DW1 and DB1, we should write them here I guess, but since I'm running of the space right on the right of the slight, DW1 and DB1 are given by the following formulas. This is going to be equal to the DZ1 times X transpose and this is going to be equal to DZ. You might notice a similarity between these equations and these equations, which is really no coincidence because X plays the role of A0 so X transpose is A0 transpose. Those equations are actually very similar. That gives a sense for how backpropagation is derived. We have six key equations here for DZ2, DW2, DB2, DZ1,DW1 and D1. Let me just take these six equations and copy them over to the next slide. Here they are. So far, we have to write backpropagation, for if you are training on a single training example at the time, but it should come as no surprise that rather than working on a single example at a time, we would like to vectorize across different training examples. We remember that for propagation, when we're operating on one example at a time, we had equations like this as was say A1=G1 of Z1. In order to vectorize, we took say the Zs and stacked them up in columns like this onto Z1M and call this capital Z. Then we found that by stacking things up in columns and defining the capital uppercase version of this, we then just had Z1=W1 X + B and A1=G1 of Z1, right? We define the notation very carefully in this course to make sure that stacking examples into different columns of a matrix makes all this work out. It turns out that if you go through the math carefully, the same trick also works for backpropagation so the vectorize equations are as follows. First, if you take these DZs for different training examples and stack them as the different columns of a matrix and the same for this and the same for this, then this is the vectorize implementation and then here's the definition for, or here's how you can compute DW2. There is this extra 1/M because the cost function J is this 1/M of sum for Y = one through M of the losses. When computing the riveters, we have that extra 1/M term just as we did when we were computing the wait up days for the logistic regression. That's the update you get for DB2. Again, some of the DZs and then with a 1/M and then DZ1 is computed as follows. Once again, this is an element Y's product only whereas previously, we saw on the previous slide that this was an N1 by one dimensional vector. Now, this is a N1 by M dimensional matrix. Both of these are also N1 by M dimensional. That's why that asterisk is element Y's product and then finally, the remaining two updates. Perhaps it shouldn't look too surprising. I hope that gives you some intuition for how the backpropagation algorithm is derived. In all of machine learning, I think the derivation of the backpropagation algorithm is actually one of the most complicated pieces of math I've seen, and it requires knowing both linear algebra as well as the derivative of matrices to re-derive it from scratch from first principles. If you are an expert in matrix calculus, using this process, you might prove the derivative algorithm yourself, but I think there are actually plenty of deep learning practitioners that have seen the derivation at about the level you've seen in this video and are already able to have all the very intuitions and be able to implement this algorithm very effectively. If you are an expert in calculus, do see if you can derive the whole thing from scratch. It is one of the very hardest pieces of math. One of the very hardest derivations that I've seen in all of machine learning. Either way, if you implement this, this will work and I think you have enough intuitions to tune and get it to work. There's just one last detail I want to share with you before you implement your neural network, which is how to initialize the weights of your neural network. It turns out that initializing your parameters, not to zero but randomly, turns out to be very important for training your neural network. In the next video, you'll see why.
Random Initialization - 7m
0:00
When you change your neural network, it's important to initialize the weights randomly. For logistic regression, it was okay to initialize the weights to zero. But for a neural network of initialize the weights to parameters to all zero and then applied gradient descent, it won't work. Let's see why. So you have here two input features, so n0=2, and two hidden units, so n1=2. And so the matrix associated with the hidden layer, w 1, is going to be two-by-two. Let's say that you initialize it to all 0s, so 0 0 0 0, two-by-two matrix. And let's say B1 is also equal to 0 0. It turns out initializing the bias terms b to 0 is actually okay, but initializing w to all 0s is a problem. So the problem with this formalization is that for any example you give it, you'll have that a1,1 and a1,2, will be equal, right? So this activation and this activation will be the same, because both of these hidden units are computing exactly the same function. And then, when you compute backpropagation, it turns out that dz11 and dz12 will also be the same colored by symmetry, right? Both of these hidden units will initialize the same way. Technically, for what I'm saying, I'm assuming that the outgoing weights or also identical. So that's w2 is equal to 0 0. But if you initialize the neural network this way, then this hidden unit and this hidden unit are completely identical. Sometimes you say they're completely symmetric, which just means that they're completing exactly the same function. And by kind of a proof by induction, it turns out that after every single iteration of training your two hidden units are still computing exactly the same function. Since [INAUDIBLE] show that dw will be a matrix that looks like this. Where every row takes on the same value. So we perform a weight update. So when you perform a weight update, w1 gets updated as w1- alpha times dw. You find that w1, after every iteration, will have the first row equal to the second row. So it's possible to construct a proof by induction that if you initialize all the ways, all the values of w to 0, then because both hidden units start off computing the same function. And both hidden the units have the same influence on the output unit, then after one iteration, that same statement is still true, the two hidden units are still symmetric. And therefore, by induction, after two iterations, three iterations and so on, no matter how long you train your neural network, both hidden units are still computing exactly the same function. And so in this case, there's really no point to having more than one hidden unit. Because they are all computing the same thing. And of course, for larger neural networks, let's say of three features and maybe a very large number of hidden units, a similar argument works to show that with a neural network like this. [INAUDIBLE] drawing all the edges, if you initialize the weights to zero, then all of your hidden units are symmetric. And no matter how long you're upgrading the center, all continue to compute exactly the same function. So that's not helpful, because you want the different hidden units to compute different functions. The solution to this is to initialize your parameters randomly. So here's what you do. You can set w1 = np.random.randn. This generates a gaussian random variable (2,2). And then usually, you multiply this by very small number, such as 0.01. So you initialize it to very small random values. And then b, it turns out that b does not have the symmetry problem, what's called the symmetry breaking problem. So it's okay to initialize b to just zeros. Because so long as w is initialized randomly, you start off with the different hidden units computing different things. And so you no longer have this symmetry breaking problem. And then similarly, for w2, you're going to initialize that randomly. And b2, you can initialize that to 0. So you might be wondering, where did this constant come from and why is it 0.01? Why not put the number 100 or 1000? Turns out that we usually prefer to initialize the weights to very small random values. Because if you are using a tanh or sigmoid activation function, or the other sigmoid, even just at the output layer. If the weights are too large, then when you compute the activation values, remember that z[1]=w1 x + b. And then a1 is the activation function applied to z1. So if w is very big, z will be very, or at least some values of z will be either very large or very small. And so in that case, you're more likely to end up at these fat parts of the tanh function or the sigmoid function, where the slope or the gradient is very small. Meaning that gradient descent will be very slow. So learning was very slow. So just a recap, if w is too large, you're more likely to end up even at the very start of training, with very large values of z. Which causes your tanh or your sigmoid activation function to be saturated, thus slowing down learning. If you don't have any sigmoid or tanh activation functions throughout your neural network, this is less of an issue. But if you're doing binary classification, and your output unit is a sigmoid function, then you just don't want the initial parameters to be too large. So that's why multiplying by 0.01 would be something reasonable to try, or any other small number. And same for w2, right? This can be random.random. I guess this would be 1 by 2 in this example, times 0.01. Missing an s there. So finally, it turns out that sometimes they can be better constants than 0.01. When you're training a neural network with just one hidden layer, it is a relatively shallow neural network, without too many hidden layers. Set it to 0.01 will probably work okay. But when you're training a very very deep neural network, then you might want to pick a different constant than 0.01. And in next week's material, we'll talk a little bit about how and when you might want to choose a different constant than 0.01. But either way, it will usually end up being a relatively small number. So that's it for this week's videos. You now know how to set up a neural network of a hidden layer, initialize the parameters, make predictions using. As well as compute derivatives and implement gradient descent, using backprop. So that, you should be able to do the quizzes, as well as this week's programming exercises. Best of luck with that. I hope you have fun with the problem exercise, and look forward to seeing you in the week four materials.
(Optional) Heroes of Deep Learning - Ian Goodfellow interview - 14m
0:02
Hi, Ian. Thanks a lot for joining us today. Thank you for inviting me, Andrew. I am glad to be here. Today, you are one of the world's most visible deep learning researchers. Let us share a bit about your personal story. So, how do you end up doing this work that you now do? Yeah. That sounds great. I guess I first became interested in machine learning right before I met you, actually. I had been working on neuroscience and my undergraduate adviser, Jerry Cain, at Stanford encouraged me to take your Intro to AI class. Oh, I didn't know that. Okay. So I had always thought that AI was a good idea, but that in practice, the main, I think, idea that was happening was like game AI, where people have a lot of hard-coded rules for non-player characters in games to say different scripted lines at different points in time. And then, when I took your Intro to AI class and you covered topics like linear regression and the variance decomposition of the error of linear regression, I started to realize that this is a real science and I could actually have a scientific career in AI rather than neuroscience. I see. Great. And then what happened? Well, I came back and I'd TA to your course later. Oh, I see. Right. Like a TA. So a really big turning point for me was while I was TA-ing that course, one of the students, my friend Ethan Dreifuss, got interested in Geoff Hinton's deep belief net paper. I see. And the two of us ended up building one of the first GPU CUDA-based machines at Stanford in order to run Watson machines in our spare time over winter break. I see. And at that point, I started to have a very strong intuition that deep learning was the way to go in the future, that a lot of the other algorithms that I was working with, like support vector machines, didn't seem to have the right asymptotics, that you add more training data and they get slower, or for the same amount of training data, it's hard to make them perform a lot better by changing other settings. At that point, I started to focus on deep learning as much as possible. And I remember Richard Reyna's very old GPU paper acknowledges you for having done a lot of early work. Yeah. Yeah. That was written using some of the machines that we built. Yeah. The first machine I built was just something that Ethan and I built at Ethan's mom's house with our own money, and then later, we used lab money to build the first two or three for the Stanford lab. Wow that's great. I never knew that story. That's great. And then, today, one of the things that's really taken the deep learning world by storm is your invention of GANs. So how did you come up with that? I've been studying generative models for a long time, so GANs are a way of doing generative modeling where you have a lot of training data and you'd like to learn to produce more examples that resemble the trading data, but they're imaginary. They've never been seen exactly in that form before. There were several other ways of doing generative models that had been popular for several years before I had the idea for GANs. And after I'd been working on all those other methods throughout most of my Ph.D., I knew a lot about the advantages and disadvantages of all the other frameworks like Boltzmann machines and sparse coding and all the other approaches that have been really popular for years. I was looking for something that avoid all these disadvantages at the same time. And then finally, when I was arguing about generative models with my friends in a bar, something clicked into place, and I started telling them, You need to do, this, this, and this and I swear it will work. And my friends didn't believe me that it would work. I was supposed to be writing the deep learning textbook at the time, I see. But I believed strongly enough that it would work that I went home and coded it up the same night and it worked. So it take you one evening to implement the first version of GANs? I implemented it around midnight after going home from the bar where my friend had his going-away party. I see. And the first version of it worked, which is very, very fortunate. I didn't have to search for hyperparameters or anything. There was a story, I read it somewhere, where you had a near-death experience and that reaffirmed your commitment to AI. Tell me that one. So, yeah. I wasn't actually near death but I briefly thought that I was. I had a very bad headache and some of the doctors thought that I might have a brain hemorrhage. And during the time that I was waiting for my MRI results to find out whether I had a brain hemorrhage or not, I realized that most of the thoughts I was having were about making sure that other people would eventually try out the research ideas that I had at the time. I see. I see. In retrospect, they're all pretty silly research ideas. I see. But at that point, I realized that this was actually one of my highest priorities in life, was carrying out my machine learning research work. I see. Yeah. That's great, that when you thought you might be dying soon, you're just thinking how to get the research done. Yeah. Yeah. That's commitment. Yeah. Yeah. Yeah. So today, you're still at the center of a lot of the activities with GANs, with Generative Adversarial Networks. So tell me how you see the future of GANs. Right now, GANs are used for a lot of different things, like semi-supervised learning, generating training data for other models and even simulating scientific experiments. In principle, all of these things could be done by other kinds of generative models. So I think that GANs are at an important crossroads right now. Right now, they work well some of the time, but it can be more of an art than a science to really bring that performance out of them. It's more or less how people felt about deep learning in general 10 years ago. And back then, we were using deep belief networks with Boltzmann machines as the building blocks, and they were very, very finicky. Over time, we switched to things like rectified linear units and batch normalization, and deep learning became a lot more reliable. If we can make GANs become as reliable as deep learning has become, then I think we'll keep seeing GANs used in all the places they're used today with much greater success. If we aren't able to figure out how to stabilize GANs, then I think their main contribution to the history of deep learning is that they will have shown people how to do all these tasks that involve generative modeling, and eventually, we'll replace them with other forms of generative models. So I spend maybe about 40 percent of my time right now working on stabilizing GANs. I see. Cool. Okay. Oh, and so just as a lot of people that joined deep learning about 10 years ago, such as yourself, wound up being pioneers, maybe the people that join GANs today, if it works out, could end up the early pioneers. Yeah. A lot of people already are early pioneers of GANs, and I think if you wanted to give any kind of history of GANs so far, you'd really need to mention other groups like Indico and Facebook and Berkeley for all the different things that they've done. So in addition to all your research, you also coauthored a book on deep learning. How is that going? That's right, with Yoshua Bengio and Aaron Courville, who are my Ph.D. co-advisers. We wrote the first textbook on the modern version of deep learning, and that has been very popular, both in the English edition and the Chinese edition. We've sold about, I think around 70,000 copies total between those two languages. And I've had a lot of feedback from students who said that they've learned a lot from it. One thing that we did a little bit differently than some other books is we start with a very focused introduction to the kind of math that you need to do in deep learning. I think one thing that I got from your courses at Stanford is that linear algebra and probability are very important, that people get excited about the machine learning algorithms, but if you want to be a really excellent practitioner, you've got to master the basic math that underlies the whole approach in the first place. So we make sure to give a very focused presentation of the math basics at the start of the book. That way, you don't need to go ahead and learn all that linear algebra, that you can get a very quick crash course in the pieces of linear algebra that are the most useful for deep learning. So even someone whose math is a little shaky or haven't seen the math for a few years will be able to start from the beginning of your book and get that background and get into deep learning? All of the facts that you would need to know are there. It would definitely take some focused effort to practice making use of them. Yeah. Yeah. Great. If someone's really afraid of math, it might be a bit of a painful experience. But if you're ready for the learning experience and you believe you can master it, I think all the tools that you need are there. As someone that worked in deep learning for a long time, I'd be curious, if you look back over the years. Tell me a bit about how you're thinking of AI and deep learning has evolved over the years. Ten years ago, I felt like, as a community, the biggest challenge in machine learning was just how to get it working for AI-related tasks at all. We had really good tools that we could use for simpler tasks, where we wanted to recognize patterns in how to extract features, where a human designer could do a lot of the work by creating those features and then hand it off to the computer. Now, that was really good for different things like predicting which ads a user would click on or different kinds of basic scientific analysis. But we really struggled to do anything involving millions of pixels in an image or a raw audio wave form where the system had to build all of its understanding from scratch. We finally got over the hurdle really thoroughly maybe five years ago. And now, we're at a point where there are so many different paths open that someone who wants to get involved in AI, maybe the hardest problem they face is choosing which path they want to go down. Do you want to make reinforcement learning work as well as supervised learning works? Do you want to make unsupervised learning work as well as supervised learning works? Do you want to make sure that machine learning algorithms are fair and don't reflect biases that we'd prefer to avoid? Do you want to make sure that the societal issues surrounding AI work out well, that we're able to make sure that AI benefits everyone rather than causing social upheaval and trouble with loss of jobs? I think right now, there's just really an amazing amount of different things that can be done, both to prevent downsides from AI but also to make sure that we leverage all of the upsides that it offers us. And so today, there are a lot of people wanting to get into AI. So, what advice would you have for someone like that? I think a lot of people that want to get into AI start thinking that they absolutely need to get a Ph.D. or some other kind of credential like that. I don't think that's actually a requirement anymore. One way that you could get a lot of attention is to write good code and put it on GitHub. If you have an interesting project that solves a problem that someone working at the top level wanted to solve, once they find your GitHub repository, they'll come find you and ask you to come work there. A lot of the people that I've hired or recruited at OpenAI last year or at Google this year, I first became interested in working with them because of something that I saw that they released in an open-source forum on the Internet. Writing papers and putting them on Archive can also be good. A lot of the time, it's harder to reach the point where you have something polished enough to really be a new academic contribution to the scientific literature, but you can often get to the point of having a useful software product much earlier. So read your book, practice the materials and post on GitHub and maybe on Archive. I think if you learned by reading the book, it's really important to also work on a project at the same time, to either choose some way of applying machine learning to an area that you are already interested in. Like if you're a field biologist and you want to get into deep learning, maybe you could use it to identify birds, or if you don't have an idea for how you'd like to use machine learning in your own life, you could pick something like making a Street View house numbers classifier, where all the data sets are set up to make it very straightforward for you. And that way, you get to exercise all of the basic skills while you read the book or while you watch Coursera videos that explain the concepts to you. So over the last couple of years, I've also seen you do one more work on adversarial examples. Tell us a bit about that. Yeah. I think adversarial examples are the beginning of a new field that I call machine learning security. In the past, we've seen computer security issues where attackers could fool a computer into running the wrong code. That's called application-level security. And there's been attacks where people can fool a computer into believing that messages on a network come from somebody that is not actually who they say they are. That's called network-level security. Now, we're starting to see that you can also fool machine-learning algorithms into doing things they shouldn't, even if the program running the machine-learning algorithm is running the correct code, even if the program running the machine-learning algorithm knows who all the messages on the network really came from. And I think, it's important to build security into a new technology near the start of its development. We found that it's very hard to build a working system first and then add security later. So I am really excited about the idea that if we dive in and start anticipating security problems with machine learning now, we can make sure that these algorithms are secure from the start instead of trying to patch it in retroactively years later. Thank you. That was great. There's a lot about your story that I thought was fascinating and that, despite having known you for years, I didn't actually know, so thank you for sharing all that. Oh, very welcome. Thank you for inviting me. It was a great shot. Okay. Thank you. Very welcome.
Deep Neural Networks
Understand the key computations underlying deep learning, use them to build and train deep neural networks, and apply it to computer vision.
Deep L-layer neural network - 5m
0:00
Welcome to the fourth week of this course. By now, you've seen four promulgation and back promulgation in the context of a neural network, with a single hidden layer, as well as logistic regression, and you've learned about vectorization, and when it's important to initialize the ways randomly. If you've done the past couple weeks homework, you've also implemented and seen some of these ideas work for yourself. So by now, you've actually seen most of the ideas you need to implement a deep neural network. What we're going to do this week, is take those ideas and put them together so that you'll be able to implement your own deep neural network. Because this week's problem exercise is longer, it just has been more work, I'm going to keep the videos for this week shorter as you can get through the videos a little bit more quickly, and then have more time to do a significant problem exercise at then end, which I hope will leave you having thoughts deep in neural network, that if you feel proud of. So what is a deep neural network? You've seen this picture for logistic regression and you've also seen neural networks with a single hidden layer. So here's an example of a neural network with two hidden layers and a neural network with 5 hidden layers. We say that logistic regression is a very "shallow" model, whereas this model here is a much deeper model, and shallow versus depth is a matter of degree. So neural network of a single hidden layer, this would be a 2 layer neural network. Remember when we count layers in a neural network, we don't count the input layer, we just count the hidden layers as was the output layer. So, this would be a 2 layer neural network is still quite shallow, but not as shallow as logistic regression. Technically logistic regression is a one layer neural network, we could then, but over the last several years the AI, on the machine learning community, has realized that there are functions that very deep neural networks can learn that shallower models are often unable to. Although for any given problem, it might be hard to predict in advance exactly how deep in your network you would want. So it would be reasonable to try logistic regression, try one and then two hidden layers, and view the number of hidden layers as another hyper parameter that you could try a variety of values of, and evaluate on all that across validation data, or on your development set. See more about that later as well. Let's now go through the notation we used to describe deep neural networks. Here's is a one, two, three, four layer neural network,
2:40
With three hidden layers, and the number of units in these hidden layers are I guess 5, 5, 3, and then there's one one upper unit. So the notation we're going to use, is going to use capital L ,to denote the number of layers in the network. So in this case, L = 4, and so does the number of layers, and we're going to use N superscript [l] to denote the number of nodes, or the number of units in layer lowercase l. So if we index this, the input as layer "0". This is layer 1, this is layer 2, this is layer 3, and this is layer 4. Then we have that, for example, n[1], that would be this, the first is in there will equal 5, because we have 5 hidden units there. For this one, we have the n[2], the number of units in the second hidden layer is also equal to 5, n[3] = 3, and n[4] = n[L] this number of upper units is 01, because your capital L is equal to four, and we're also going to have here that for the input layer n[0] = nx = 3. So that's the notation we use to describe the number of nodes we have in different layers. For each layer L, we're also going to use a[l] to denote the activations in layer l. So we'll see later that in for propagation, you end up computing a[l] as the activation g(z[l]) and perhaps the activation is indexed by the layer l as well, and then we'll use W[l ]to denote, the weights for computing the value z[l] in layer l, and similarly, b[l] is used to compute z [l]. Finally, just to wrap up on the notation, the input features are called x, but x is also the activations of layer zero, so a[0] = x, and the activation of the final layer, a[L] = y-hat. So a[L] is equal to the predicted output to prediction y-hat to the neural network. So you now know what a deep neural network looks like, as was the notation we'll use to describe and to compute with deep networks. I know we've introduced a lot of notation in this video, but if you ever forget what some symbol means, we've also posted on the course website, a notation sheet or a notation guide, that you can use to look up what these different symbols mean. Next, I'd like to describe what forward propagation in this type of network looks like. Let's go into the next video.
Forward Propagation in a Deep Network - 7m
0:00
In the last video, we described what is a deep L-layer neural network and also talked about the notation we use to describe such networks. In this video, you see how you can perform forward propagation, in a deep network. As usual, let's first go over what forward propagation will look like for a single training example x, and then later on we'll talk about the vectorized version, where you want to carry out forward propagation on the entire training set at the same time. But given a single training example x, here's how you compute the activations of the first layer. So for this first layer, you compute z1 equals w1 times x plus b1. So w1 and b1 are the parameters that affect the activations in layer one. This is layer one of the neural network, and then you compute the activations for that layer to be equal to g of z1. The activation function g depends on what layer you're at and maybe what index set as the activation function from layer one. So if you do that, you've now computed the activations for layer one. How about layer two? Say that layer. Well, you would then compute z2 equals w2 a1 plus b2. Then, so the activation of layer two is the y matrix times the outputs of layer one. So, it's that value, plus the bias vector for layer two. Then a2 equals the activation function applied to z2. Okay? So that's it for layer two, and so on and so forth. Until you get to the upper layer, that's layer four. Where you would have that z4 is equal to the parameters for that layer times the activations from the previous layer, plus that bias vector. Then similarly, a4 equals g of z4. So, that's how you compute your estimated output, y hat. So, just one thing to notice, x here is also equal to a0, because the input feature vector x is also the activations of layer zero. So we scratch out x. When I cross out x and put a0 here, then all of these equations basically look the same. The general rule is that zl is equal to wl times a of l minus 1 plus bl. It's one there. And then, the activations for that layer is the activation function applied to the values of z. So, that's the general forward propagation equation. So, we've done all this for a single training example. How about for doing it in a vectorized way for the whole training set at the same time? The equations look quite similar as before. For the first layer, you would have capital Z1 equals w1 times capital X plus b1. Then, A1 equals g of Z1. Bear in mind that X is equal to A0. These are just the training examples stacked in different columns. You could take this, let me scratch out X, they can put A0 there. Then for the next layer, looks similar, Z2 equals w2 A1 plus b2 and A2 equals g of Z2. We're just taking these vectors z or a and so on, and stacking them up. This is z vector for the first training example, z vector for the second training example, and so on, down to the nth training example, stacking these and columns and calling this capital Z. Similarly, for capital A, just as capital X. All the training examples are column vectors stack left to right. In this process, you end up with y hat which is equal to g of Z4, this is also equal to A4. That's the predictions on all of your training examples stacked horizontally. So just to summarize on notation, I'm going to modify this up here. A notation allows us to replace lowercase z and a with the uppercase counterparts, is that already looks like a capital Z. That gives you the vectorized version of forward propagation that you carry out on the entire training set at a time, where A0 is X. Now, if you look at this implementation of vectorization, it looks like that there is going to be a For loop here. So therefore l equals 1-4. For L equals 1 through capital L. Then you have to compute the activations for layer one, then layer two, then for layer three, and then the layer four. So, seems that there is a For loop here. I know that when implementing neural networks, we usually want to get rid of explicit For loops. But this is one place where I don't think there's any way to implement this without an explicit For loop. So when implementing forward propagation, it is perfectly okay to have a For loop to compute the activations for layer one, then layer two, then layer three, then layer four. No one knows, and I don't think there is any way to do this without a For loop that goes from one to capital L, from one through the total number of layers in the neural network. So, in this place, it's perfectly okay to have an explicit For loop. So, that's it for the notation for deep neural networks, as well as how to do forward propagation in these networks. If the pieces we've seen so far looks a little bit familiar to you, that's because what we're seeing is taking a piece very similar to what you've seen in the neural network with a single hidden layer and just repeating that more times. Now, it turns out that we implement a deep neural network, one of the ways to increase your odds of having a bug-free implementation is to think very systematic and carefully about the matrix dimensions you're working with. So, when I'm trying to debug my own code, I'll often pull a piece of paper, and just think carefully through, so the dimensions of the matrix I'm working with. Let's see how you could do that in the next video.
Getting your matrix dimensions right - 11m
0:00
When implementing a deep neural network, one of the debugging tools I often use to check the correctness of my code is to pull a piece of paper, and just work through the dimensions and matrix I'm working with. So let me show you how to do that, since I hope this will make it easier for you to implement your deep nets as well. Capital L is equal to 5, right, counting quickly, not counting the input layer, there are five layers here, so four hidden layers and one output layer. And so if you implement forward propagation, the first step will be z1 = w1x + b1. So let's ignore the bias terms b for now, and focus on the parameters w. Now this first hidden layer has three hidden units, so this is layer 0, layer 1, layer 2, layer 3, layer 4, and layer 5. So using the notation we had from the previous video, we have that n1, which is the number of hidden units in layer 1, is equal to 3. And here we would have the n2 is equal to 5, n3 is equal to 4, n4 is equal to 2, and n5 is equal to 1. And so far we've only seen neural networks with a single output unit, but in later courses, we'll talk about neutral networks with multiple output units as well. And finally, for the input layer, we also have n0 = nx = 2. So now, let's think about the dimensions of z, w, and x. z is the vector of activations for this first hidden layer, so z is going to be 3 by 1, it's going to be a 3-dimensional vector. So I'm going to write it a n1 by 1-dimensional vector, n1 by 1-dimensional matrix, all right, so 3 by 1 in this case. Now how about the input features x, x, we have two input features. So x is in this example 2 by 1, but more generally, it would be n0 by 1. So what we need is for the matrix w1 to be something that when we multiply an n0 by 1 vector to it, we get an n1 by 1 vector, right? So you have sort of a three dimensional vector equals something times a two dimensional vector. And so by the rules of matrix multiplication, this has got be a 3 by 2 matrix. Right, because a 3 by 2 matrix times a 2 by 1 matrix, or times the 2 by 1 vector, that gives you a 3 by 1 vector. And more generally, this is going to be an n1 by n0 dimensional matrix. So what we figured out here is that the dimensions of w1 has to be n1 by n0. And more generally, the dimensions of wL must be nL by nL minus 1. So for example, the dimensions of w2, for this, it would have to be 5 by 3, or it would be n2 by n1. Because we're going to compute z2 as w2 times a1, and again, let's ignore the bias for now. And so this is going to be 3 by 1, and we need this to be 5 by 1, and so this had better be 5 by 3. And similarly, w3 is really the dimension of the next layer, comma, the dimension of the previous layer, so this is going to be 4 by 5, w4
4:22
Is going to be 2 by 4, and w5 is going to be 1 by 2, okay? So the general formula to check is that when you're implementing the matrix for layer L, that the dimension of that matrix be nL by nL-1. Now let's think about the dimension of this vector b. This is going to be a 3 by 1 vector, so you have to add that to another 3 by 1 vector in order to get a 3 by 1 vector as the output. Or in this example, we need to add this, this is going to be 5 by 1, so there's going to be another 5 by 1 vector. In order for the sum of these two things I have in the boxes to be itself a 5 by 1 vector. So the more general rule is that in the example on the left, b1 is n1 by 1, right, that's 3 by 1, and in the second example, this is n2 by 1. And so the more general case is that bL should be nL by 1 dimensional. So hopefully these two equations help you to double check that the dimensions of your matrices w, as well as your vectors p, are the correct dimensions. And of course, if you're implementing back propagation, then the dimensions of dw should be the same as the dimension of w. So dw should be the same dimension as w, and db should be the same dimension as b. Now the other key set of quantities whose dimensions to check are these z, x, as well as a of L, which we didn't talk too much about here. But because z of L is equal to g of a of L, applied element wise, then z and a should have the same dimension in these types of networks. Now let's see what happens when you have a vectorized implementation that looks at multiple examples at a time. Even for a vectorized implementation, of course, the dimensions of wb, dw, and db will stay the same. But the dimensions of z, a, as well as x will change a bit in your vectorized implementation. So previously, we had z1 = w1x+b1 where this was n1 by 1, this was n1 by n0, x was n0 by 1, and b was n1 by 1. Now, in a vectorized implementation, you would have z1 = w1x + b1. Where now z1 is obtained by taking the z1 for the individual examples, so there's z11, z12, up to z1m, and stacking them as follows, and this gives you z1. So the dimension of z1 is that, instead of being n1 by 1, it ends up being n1 by m, and m is the size you're trying to set. The dimensions of w1 stays the same, so it's still n1 by n0. And x, instead of being n0 by 1 is now all your training examples stacked horizontally. So it's now n 0 by m, and so you notice that when you take a n1 by n0 matrix and multiply that by an n0 by m matrix. That together they actually give you an n1 by m dimensional matrix, as expected. Now, the final detail is that b1 is still n1 by 1, but when you take this and add it to b, then through Python broadcasting, this will get duplicated and turn n1 by m matrix, and then add the element wise. So on the previous slide, we talked about the dimensions of wb, dw, and db. Here, what we see is that whereas zL as well as aL are of dimension nL by 1, we have now instead that ZL as well AL are nL by m. And a special case of this is when L is equal to 0, in which case A0, which is equal to just your training set input features X, is going to be equal to n0 by m as expected. And of course when you're implementing this in backpropagation, we'll see later you, end up computing dZ as well as dA. And so these will of course have the same dimension as Z and A. So I hope the little exercise we went through helps clarify the dimensions that the various matrices you'd be working with. When you implement backpropagation for a deep neural network, so long as you work through your code and make sure that all the matrices' dimensions are consistent. That will usually help, it'll go some ways toward eliminating some cause of possible bugs. So I hope that exercise for figuring out the dimensions of various matrices you'll been working with is helpful. When you implement a deep neural network, if you keep straight the dimensions of these various matrices and vectors you're working with. Hopefully they'll help you eliminate some cause of possible bugs, it certainly helps me get my code right. So next, we've now seen some of the mechanics of how to do forward propagation in a neural network. But why are deep neural networks so effective, and why do they do better than shallow representations? Let's spend a few minutes in the next video to discuss that.
Why deep representations? - 10m
0:00
We've all been hearing that deep neural networks work really well for a lot of problems, and it's not just that they need to be big neural networks, is that specifically, they need to be deep or to have a lot of hidden layers. So why is that? Let's go through a couple examples and try to gain some intuition for why deep networks might work well. So first, what is a deep network computing? If you're building a system for face recognition or face detection, here's what a deep neural network could be doing. Perhaps you input a picture of a face then the first layer of the neural network you can think of as maybe being a feature detector or an edge detector. In this example, I'm plotting what a neural network with maybe 20 hidden units, might be trying to compute on this image. So the 20 hidden units visualized by these little square boxes. So for example, this little visualization represents a hidden unit that's trying to figure out where the edges of that orientation are in the image. And maybe this hidden unit might be trying to figure out where are the horizontal edges in this image. And when we talk about convolutional networks in a later course, this particular visualization will make a bit more sense. But the form, you can think of the first layer of the neural network as looking at the picture and trying to figure out where are the edges in this picture. Now, let's think about where the edges in this picture by grouping together pixels to form edges. It can then detect the edges and group edges together to form parts of faces. So for example, you might have a low neuron trying to see if it's finding an eye, or a different neuron trying to find that part of the nose. And so by putting together lots of edges, it can start to detect different parts of faces. And then, finally, by putting together different parts of faces, like an eye or a nose or an ear or a chin, it can then try to recognize or detect different types of faces. So intuitively, you can think of the earlier layers of the neural network as detecting simple functions, like edges. And then composing them together in the later layers of a neural network so that it can learn more and more complex functions. These visualizations will make more sense when we talk about convolutional nets. And one technical detail of this visualization, the edge detectors are looking in relatively small areas of an image, maybe very small regions like that. And then the facial detectors you can look at maybe much larger areas of image. But the main intuition you take away from this is just finding simple things like edges and then building them up. Composing them together to detect more complex things like an eye or a nose then composing those together to find even more complex things. And this type of simple to complex hierarchical representation, or compositional representation, applies in other types of data than images and face recognition as well. For example, if you're trying to build a speech recognition system, it's hard to revisualize speech but if you input an audio clip then maybe the first level of a neural network might learn to detect low level audio wave form features, such as is this tone going up? Is it going down? Is it white noise or sniffling sound like [SOUND]. And what is the pitch? When it comes to that, detect low level wave form features like that. And then by composing low level wave forms, maybe you'll learn to detect basic units of sound. In linguistics they call phonemes. But, for example, in the word cat, the C is a phoneme, the A is a phoneme, the T is another phoneme. But learns to find maybe the basic units of sound and then composing that together maybe learn to recognize words in the audio. And then maybe compose those together, in order to recognize entire phrases or sentences. So deep neural network with multiple hidden layers might be able to have the earlier layers learn these lower level simple features and then have the later deeper layers then put together the simpler things it's detected in order to detect more complex things like recognize specific words or even phrases or sentences. The uttering in order to carry out speech recognition. And what we see is that whereas the other layers are computing, what seems like relatively simple functions of the input such as where the edge is, by the time you get deep in the network you can actually do surprisingly complex things. Such as detect faces or detect words or phrases or sentences. Some people like to make an analogy between deep neural networks and the human brain, where we believe, or neuroscientists believe, that the human brain also starts off detecting simple things like edges in what your eyes see then builds those up to detect more complex things like the faces that you see. I think analogies between deep learning and the human brain are sometimes a little bit dangerous. But there is a lot of truth to, this being how we think that human brain works and that the human brain probably detects simple things like edges first then put them together to from more and more complex objects and so that has served as a loose form of inspiration for some deep learning as well. We'll see a bit more about the human brain or about the biological brain in a later video this week.
5:35
The other piece of intuition about why deep networks seem to work well is the following. So this result comes from circuit theory of which pertains the thinking about what types of functions you can compute with different AND gates, OR gates, NOT gates, basically logic gates. So informally, their functions compute with a relatively small but deep neural network and by small I mean the number of hidden units is relatively small. But if you try to compute the same function with a shallow network, so if there aren't enough hidden layers, then you might require exponentially more hidden units to compute. So let me just give you one example and illustrate this a bit informally. But let's say you're trying to compute the exclusive OR, or the parity of all your input features. So you're trying to compute X1, XOR, X2, XOR, X3, XOR, up to Xn if you have n or n X features. So if you build in XOR tree like this, so for us it computes the XOR of X1 and X2, then take X3 and X4 and compute their XOR. And technically, if you're just using AND or NOT gate, you might need a couple layers to compute the XOR function rather than just one layer, but with a relatively small circuit, you can compute the XOR, and so on. And then you can build, really, an XOR tree like so, until eventually, you have a circuit here that outputs, well, lets call this Y. The outputs of Y hat equals Y. The exclusive OR, the parity of all these input bits. So to compute XOR, the depth of the network will be on the order of log N. We'll just have an XOR tree. So the number of nodes or the number of circuit components or the number of gates in this network is not that large. You don't need that many gates in order to compute the exclusive OR. But now, if you are not allowed to use a neural network with multiple hidden layers with, in this case, order log and hidden layers, if you're forced to compute this function with just one hidden layer, so you have all these things going into the hidden units. And then these things then output Y. Then in order to compute this XOR function, this hidden layer will need to be exponentially large, because essentially, you need to exhaustively enumerate our 2 to the N possible configurations. So on the order of 2 to the N, possible configurations of the input bits that result in the exclusive OR being either 1 or 0. So you end up needing a hidden layer that is exponentially large in the number of bits. I think technically, you could do this with 2 to the N minus 1 hidden units. But that's the older 2 to the N, so it's going to be exponentially larger on the number of bits. So I hope this gives a sense that there are mathematical functions, that are much easier to compute with deep networks than with shallow networks. Actually, I personally found the result from circuit theory less useful for gaining intuitions, but this is one of the results that people often cite when explaining the value of having very deep representations. Now, in addition to this reasons for preferring deep neural networks, to be perfectly honest, I think the other reasons the term deep learning has taken off is just branding. This things just we call neural networks with a lot of hidden layers, but the phrase deep learning is just a great brand, it's just so deep. So I think that once that term caught on that really neural networks rebranded or neural networks with many hidden layers rebranded, help to capture the popular imagination as well. But regardless of the PR branding, deep networks do work well. Sometimes people go overboard and insist on using tons of hidden layers. But when I'm starting out a new problem, I'll often really start out with even logistic regression then try something with one or two hidden layers and use that as a hyper parameter. Use that as a parameter or hyper parameter that you tune in order to try to find the right depth for your neural network. But over the last several years there has been a trend toward people finding that for some applications, very, very deep neural networks here with maybe many dozens of layers sometimes, can sometimes be the best model for a problem. So that's it for the intuitions for why deep learning seems to work well. Let's now take a look at the mechanics of how to implement not just front propagation, but also back propagation.
Building blocks of deep neural networks - 8m
0:00
In the earlier videos from this week, as well as from the videos from the past several weeks, you've already seen the basic building blocks of forward propagation and back propagation, the key components you need to implement a deep neural network. Let's see how you can put these components together to build your deep net.
0:18
Here's a network of a few layers. Let's pick one layer.
0:22
And look into the computations focusing on just that layer for now. So for layer L, you have some parameters wl and
0:33
bl and for the forward prop, you will input the activations a l-1 from your previous layer and output a l. So the way we did this previously was you compute z l = w l times al - 1 + b l. And then al = g of z l. All right. So, that's how you go from the input al minus one to the output al. And, it turns out that for later use it'll be useful to also cache the value zl. So, let me include this on cache as well because storing the value zl would be useful for backward, for the back propagation step later. And then for the backward step or for the back propagation step, again, focusing on computation for this layer l, you're going to implement a function that inputs da(l).
1:45
And outputs da(l-1), and just to flesh out the details, the input is actually da(l), as well as the cache so you have available to you the value of zl that you computed and then in addition, outputing da(l) minus 1 you bring the output or the gradients you want in order to implement gradient descent for learning, okay? So this is the basic structure of how you implement this forward step, what we call the forward function as well as this backward step, which we'll call backward function. So just to summarize, in layer l, you're going to have the forward step or the forward prop of the forward function. Input al- 1 and output, al, and in order to make this computation you need to use wl and bl. And also output a cache, which contains zl, right? And then the backward function, using the back prop step, will be another function that now inputs da(l) and outputs da(l-1). So it tells you, given the derivatives respect to these activations, that's da(l), what are the derivatives? How much do I wish? You know, al- 1 changes the computed derivatives respect to deactivations from a previous layer. Within this box, right? You need to use wl and bl, and it turns out along the way you end up computing dzl, and then this box, this backward function can also output dwl and dbl, but I was sometimes using red arrows to denote the backward iteration. So if you prefer, we could draw these arrows in red.
3:51
So if you can implement these two functions then the basic computation of the neural network will be as follows. You're going to take the input features a0, feed that in, and that would compute the activations of the first layer, let's call that a1 and to do that, you need a w1 and b1 and then will also, you know, cache away z1, right?
4:21
Now having done that, you feed that to the second layer and then using w2 and b2, you're going to compute deactivations in the next layer a2 and so on. Until eventually, you end up outputting a l which is equal to y hat. And along the way, we cached all of these values z.
4:52
So that's the forward propagation step. Now, for the back propagation step, what we're going to do will be a backward sequence of iterations
5:05
in which you are going backwards and computing gradients like so.
5:12
So what you're going to feed in here, da(l) and then this box will give us da(l- 1) and so on until we get da(2) da(1). You could actually get one more output to compute da(0) but this is derivative with respect to your input features, which is not useful at least for training the weights of these supervised neural networks. So you could just stop it there. But along the way, back prop also ends up outputting dwl, dbl. I just used the prompt as wl and bl. This would output dw3, db3 and so on.
6:10
So you end up computing all the derivatives you need.
6:16
And so just to maybe fill in the structure of this a little bit more, these boxes will use those parameters as well.
6:26
wl, bl and it turns out that we'll see later that inside these boxes we end up computing the dz's as well. So one iteration of training through a neural network involves: starting with a(0) which is x and going through forward prop as follows. Computing y hat and then using that to compute this and then back prop, right, doing that and now you have all these derivative terms and so, you know, w would get updated as w1 = the learning rate times dw, right? For each of the layers and similarly for b rate. Now the computed back prop have all these derivatives. So that's one iteration of gradient descent for your neural network. Now before moving on, just one more informational detail. Conceptually, it will be useful to think of the cache here as storing the value of z for the backward functions. But when you implement this, and you see this in the programming exercise, When you implement this, you find that the cache may be a convenient way to get to this value of the parameters of w1, b1, into the backward function as well. So for this exercise you actually store in your cache to z as well as w and b. So this stores z2, w2, b2. But from an implementation standpoint, I just find it a convenient way to just get the parameters, copy to where you need to use them later when you're computing back propagation. So that's just an implementational detail that you see when you do the programming exercise. So you've now seen what are the basic building blocks for implementing a deep neural network. In each layer there's a forward propagation step and there's a corresponding backward propagation step. And has a cache to pass information from one to the other. In the next video, we'll talk about how you can actually implement these building blocks. Let's go on to the next video.
Forward and Backward Propagation - 10m
0:00
In the previous video, you saw the basic blocks of implementing a deep neural network. A forward propagation step for each layer, and a corresponding backward propagation step. Let's see how you can actually implement these steps. We'll start with forward propagation. Recall that what this will do is input a[l-1] and output a[l], and the cache z[l]. And we just said that an implementational point of view, maybe where cache w[l] and b[l] as well, just to make the functions come a bit easier in the problem exercise. And so, the equations for this should already look familiar. The way to implement a forward function is just this equals w[l] * a[l-1] + b[l], and then, a[l] equals deactivation function applied to z. And if you want to vectorize implementation, then it's just that times a[l-1] + b, adding b, being a hyper-broadcasting, and a[l] = g applied element-wise to z. And you remember, on the diagram for the forward step, remember we had this chain of boxes going forward, so you initialize that with feeding an a[0], which is equal to X. So, you initialized this. Really, what is the input to the first one, right? It's really a[0] which is the input features to either for one training sample, if you're doing one example at a time, or A[0], the entire training set, if you are processing the entire training set. So that's the input to the first four functions in the chain, and then just repeating this allows you to compute forward propagation from left to right. Next, let's talk about the backward propagation step. Here, your goal is to input da[l], and output da[l-1] and dW[l] and db. Let me just right out the steps you need to compute these things: dz[l] = da[l], element-wise product with g[l]z[l], and then, to compute the derivatives dW[l] = dz[l] * a[l - 1]. I didn't explicitly put that in the cache but it turns out, you need this as well. And then, db[l] = dz[l], and finally, da[l-1] = w[l]_transpose * dz[l], okay? And, I don't want to go through the detailed derivation for this, but it turns out that if you take this definition for da and plug it in here, then you get the same formula as we had in the previous week, for how you compute dz[l] as a function of the previous dz[l], in fact, well, If I just plug that in here, you end up that dz[l] = w[l+1]_transpose dz[l+1] * g[l]
z[l]. I know this looks like a lot of algebra, You can actually double check for yourself that this is the equation we have written down for back propagation last week when we are doing a neural network with just a single hidden layer. And as reminder, this time, this element-wise product, and so all you need is those four equations to implement your backward function. And then finally, I'll just write out a vectorized version. So the first line becomes dz[l] = dA[l], element-wise product with g[l]` of z[l]. Maybe no surprise there. dW[l] becomes 1/m, dz[l] * a[l-1]_transpose and then, db[l] becomes 1/m np.sum dz[l], then, axis = 1, keepdims = true. We talked about the use of np.sum in the previous week to compute db. And then finally, dA[l-1] is W[l]_transpose * dz[l]. So this allows you to input this quantity, da, over here, and output dW[l], db[l], the derivatives you need, as well as dA[l-1], right? As follows. So that's how you implement the backward function. So just to summarize, take the input X, you might have the first layer, maybe has a ReLU activation function. Then go to the second layer, maybe uses another ReLU activation function, goes to the third layer, maybe has a Sigmoid activation function if you're doing binary classification, and this outputs y_hat. And then, using y_hat, you can compute the loss, and this allows you to start your backward iteration. I'll draw the arrows first, okay? So I don't have to change pens too much. Where you will then have back-prop compute the derivatives, to compute dW[3], db[3], dW[2], db[2], dW[1], db[1], and along the way you would be computing, I guess, the cache would transfer z[1], z[2], z[3], and here you pause backward da[2] and da[1]. This could compute da[0], but we won't use that. So you can just discard that, right? And so, this is how you implement forward-prop and back-prop for a three layer neural network. Now, there's just one last detail that I didn't talk about which is for the forward recursion, we will initialize it with the input data X. How about the backward recursion? Well, it turns out that da[l], when you're using logistic regression, when you're doing binary classification, is equal to y/a + 1-y/1-a. So it turns out that the derivative for the loss function, respect to the output, with respect to y_hat, can be shown to be what it is. If you're familiar with calculus, If you look up the loss function l, and take the riveters, respect to y_hat or respect to a, you can show that you get that formula. So this is the formula you should use for da for the final layer, capital L. And of course, if you were to have a vectorized implementation, then you initialize the backward recursion, not with this but with dA for the layer l, which is going to be the same thing for the different examples, over a, for the first training example, + 1-y, for the first training example, over 1-a, for the first training example, ...down to the end training example, so 1-a[m]. So that's how you implement the vectorized version. That's how you initialize the vectorized version of back propagation. So you've now seen the basic building blocks of both forward propagation as well as back propagation. Now, if you implement these equations, you will get a correct implementation of forward-prop and back-prop to get you the derivatives you need. You might be thinking, while there was a lot of equation, I'm slightly confused, I'm not quite sure I see how this works. And if you're feeling that way, my advice is, when you get to this week's programming assignment, you will be able to implement these for yourself, and they will be much more concrete. And I know there is lot of equations, and maybe some equations didn't make complete sense, but if you work through the calculus, and the linear algebra, which is not easy, so feel free to try, but that's actually one of the more difficult derivations in machine learning. It turns out the equations roll down, or just the calculus equations for computing the derivatives specially in back-prop. But once again, if this feels a little bit abstract, a little bit mysterious to you, my advice is, when you've done the primary exercise, it will feel a bit more concrete to you. Although I have to say, even today, when I implement a learning algorithm, sometimes, even I'm surprised when my learning algorithms implementation works and it's because a lot of complexity of machine learning comes from the data rather than from the lines of code. So sometimes, you feel like, you implement a few lines of code, not quite sure what it did, but this almost magically works, because a lot of magic is actually not in the piece of code you write, which is often not too long. It's not exactly simple, but it's not ten thousand, a hundred thousand lines of code, but your feeding so much data that sometimes, even though I've worked in machine learning for a long time, sometimes, it still surprises me a bit when my learning algorithm works because a lot of complexity of your learning algorithm comes from the data rather than necessarily from your writing thousands and thousands of lines of code. All right. So, that's how you implement deep neural networks. And again, this will become more concrete when you done the primary exercise. Before moving on, in the next video, I want to discuss hyper parameters and parameters. It turns out that when you're training deep nets, being able to organize your hyper parameters well will help you be more efficient in developing your networks. In the next video, let's talk about exactly what that means.
Parameters vs Hyperparameters - 7m
0:00
being effective in developing your deep neural Nets requires that you not only organize your parameters well but also your hyper parameters so what are hyper parameters let's take a look so the parameters your model are W and B and there are other things you need to tell your learning algorithm such as the learning rate alpha because on we need to set alpha and that in turn will determine how your parameters evolve or maybe the number of iterations of gradient descent you carry out your learning algorithm has other you know numbers that you need to set such as the number of hidden layers so we call that capital L or the number of hidden units right such as zero and one and two and so on and then you also have the choice of activation function do you want to use a rel you or ten age or a sigma little something especially in the hidden layers and so all of these things are things that you need to tell your learning algorithm and so these are parameters that control the ultimate parameters W and B and so we call all of these things below hyper parameters because these things like alpha the learning rate the number of iterations number of hidden layers and so on these are all parameters that control W and B so we call these things hyper parameters because it is the hyper parameters that you know somehow determine the final value of the parameters W and B that you end up with in fact deep learning has a lot of different hyper parameters later in the later course we'll see other hyper parameters as well such as the momentum term the mini batch size various forms of regularization parameters and so on and if none of these terms at the bottom make sense yet don't worry about it we'll talk about them in the second course because deep learning has so many hyper parameters in contrast to earlier errors of machine learning I'm going to try to be very consistent in calling the learning rate alpha a hyper parameter rather than calling the parameter I think in earlier eras of machine learning when we didn't have so many hyper parameters most of us used to be a bit slow up here and just call alpha a parameter and technically alpha is a parameter but is a parameter that determines the real parameters our childhood consistent in calling these things like alpha the number of iterations and so on hyper parameters so when you're training a deep net for your own application you find that there may be a lot of possible settings for the hyper parameters that you need to just try out so apply deep learning today is a very imperiled process where often you might have an idea for example you might have an idea for the best value for the learning rate you might say well maybe alpha equals 0.01 I want to try that then you implemented try it out and then see how that works and then based on that outcome you might say you know what I've changed online I want to increase the learning rate to 0.05 and so if you're not sure what's the best value for the learning ready-to-use you might try one value of the learning rate alpha and see their cost function j go down like this then you might try a larger value for the learning rate alpha and see the cost function blow up and diverge then you might try another version and see it go down really fast it's inverse to higher value you might try another version and see it you know see the cost function J do that then I'll be China so the values you might say okay looks like this the value of alpha gives me a pretty fast learning and allows me to converge to a lower cost function jennice I'm going to use this value of alpha you saw in a previous slide that there are a lot of different hybrid parameters and it turns out that when you're starting on the new application I should find it very difficult to know in advance exactly what's the best value of the hyper parameters so what often happen is you just have to try out many different values and go around this cycle your trial some value really try five hidden layers with this many number of hidden units implement that see if it works and then iterate so the title of this slide is that apply deep learning is very empirical process and empirical process is maybe a fancy way of saying you just have to try a lot of things and see what works another effect I've seen is that deep learning today is applied to so many problems ranging from computer vision to speech recognition to natural language processing to a lot of structured data applications such as maybe a online advertising or web search or product recommendations and so on and what I've seen is that first I've seen researchers from one discipline any one of these try to go to a different one and sometimes the intuitions about hyper parameters carries over and sometimes it doesn't so I often advise people especially when starting on a new problem to just try out a range of values and see what works and then mix course we'll see a systematic way we'll see some systematic ways for trying out a range of values all right and second even if you're working on one application for a long time you know maybe you're working on online advertising as you make progress on the problem is quite possible there the best value for the learning rate a number of hidden units and so on might change so even if you tune your system to the best value of hyper parameters to daily as possible you find that the best value might change a year from now maybe because the computer infrastructure I'd be it you know CPUs or the type of GPU running on or something has changed but so maybe one rule of thumb is you know every now and then maybe every few months if you're working on a problem for an extended period of time for many years just try a few values for the hyper parameters and double check if there's a better value for the hyper parameters and as you do so you slowly gain intuition as well about the hyper parameters that work best for your problems and I know that this might seem like an unsatisfying part of deep learning that you just have to try on all the values for these hyper parameters but maybe this is one area where deep learning research is still advancing and maybe over time we'll be able to give better guidance for the best hyper parameters to use but it's also possible that because CPUs and GPUs and networks and data says are all changing and it is possible that the guidance won't to converge for some time and you just need to keep trying out different values and evaluate them on a hold on cross-validation set or something and pick the value that works for your problems so that was a brief discussion of hyper parameters in the second course we'll also give some suggestions for how to systematically explore the space of hyper parameters but by now you actually have pretty much all the tools you need to do their programming exercise before you do that adjust or share view one more set of ideas which is I often ask what does deep learning have to do the human brain
What does this have to do with the brain? - 3m
0:00
So, what does deep learning have to do with the brain? At the risk of giving away the punchline, I would say not a whole lot. But let's take a quick look at why people keep making the analogy between deep learning and the human brain. When you implement a neural network, this is what you do, forward prop and back prop. I think because it's been difficult to convey intuitions about what these equations are doing really gradient descent on a very complex function, the analogy that is like the brain has become really an oversimplified explanation for what this is doing, but the simplicity of this makes it seductive for people to just say it publicly, as well as, for media to report it, and certainly caught the popular imagination. There is a very loose analogy between, let's say a logistic regression unit with a sigmoid activation function, and here's a cartoon of a single neuron in the brain. In this picture of a biological neuron, this neuron, which is a cell in your brain, receives electric signals from your other neurons, X_1, X_2, X_3, or maybe from other neurons A_1, A_2, A_3, does a simple thresholding computation, and then if this neuron fires, it sends a pulse of electricity down the axon, down this long wire perhaps to other neurons. So, there is a very simplistic analogy between a single neuron in a neural network and a biological neuron-like that shown on the right, but I think that today even neuroscientists have almost no idea what even a single neuron is doing. A single neuron appears to be much more complex than we are able to characterize with neuroscience, and while some of what is doing is a little bit like logistic regression, there's still a lot about what even a single neuron does that no human today understands. For example, exactly how neurons in the human brain learns, is still a very mysterious process. It's completely unclear today whether the human brain uses an algorithm, does anything like back propagation or gradient descent or if there's some fundamentally different learning principle that the human brain uses? So, when I think of deep learning, I think of it as being very good at learning very flexible functions, very complex functions to learn X to Y mappings, to learn input-output mappings in supervised learning. Whereas this is like the brain analogy, maybe that was useful ones. I think the field has moved to the point where that analogy is breaking down and I tend not to use that analogy much anymore. So, that's it for neural networks and the brain. I do think that maybe the few that computer vision has taken a bit more inspiration from the human brain than other disciplines that also apply deep learning, but I personally use the analogy to the human brain less than I used to. So, that's it for this video. You now know how to implement forward prop and back prop and gradient descent even for deep neural networks. Best of luck with the problem exercise, and I look forward to sharing more of these ideas with you in the second course.