In the last chapter, we learned how to design trading strategies, create trading signals, and implement advanced concepts, such as seasonality in trading instruments. Understanding those concepts in greater detail is a vast field comprising stochastic processes, random walks, martingales, and time series analysis, which we leave to you to explore at your own pace.
So what's next? Let's look at an even more advanced method of prediction and forecasting: statistical inference and prediction. This is known as machine learning, the fundamentals of which were developed in the 1800s and early 1900s and have been worked on ever since. Recently, there has been a resurgence in interest in machine learning algorithms and applications owing to the availability of extremely cost-effective processing power and the easy availability of large datasets. Understanding machine learning techniques in great detail is a massive field at the intersection of linear algebra, multivariate calculus, probability theory, frequentist and Bayesian statistics, and an in-depth analysis of machine learning is beyond the scope of a single book. Machine learning methods, however, are surprisingly easily accessible in Python and quite intuitive to understand, so we will explain the intuition behind the methods and see how they find applications in algorithmic trading. But first, let's introduce some basic concepts and notation that we will need for the rest of this chapter.
This chapter will cover the following topics:
To develop ideas quickly and build an intuition regarding supply and demand, we have a simple and completely hypothetical dataset of height, weight, and race of a few random samples obtained from a survey. Let's have a look at the dataset:
Let's examine the individual fields:
Now, given this dataset, say our task is to build a mathematical model that can learn from the data we provide it with. The task or objective we are trying to learn in this example is to find the relationship between the weight of a person as it relates to their height and race. Intuitively, it should be obvious that height will have a major role to play (taller people are much more likely to be heavier), and race should have very little impact. Race may have some impact on the height of an individual, but once the height is known, knowing their race also provides very little additional information in guessing/predicting a person's weight. In this particular problem, note that in the dataset, we are also provided the weight of the samples in addition to their height and race.
Since the variable we are trying to learn how to predict is known, this is known as a supervised learning problem. If, on the other hand, we were not provided with the weight variable and were asked to predict whether, based on height and race, someone is more likely to be heavier than someone else, that would be an unsupervised learning problem. For the scope of this chapter, we will focus on supervised learning problems only, since that is the most typical use case of machine learning in algorithmic trading.
Another thing to address in this example is the fact that, in this case, we are trying to predict weight as a function of height and race. So we are trying to predict a continuous variable. This is known as a regression problem, since the output of such a model is a continuous value. If, on the other hand, say our task was to predict the race of a person as a function of their height and weight, in that case, we would be trying to predict a categorical variable type. This is known as a classification problem, since the output of such a model will be one value from a set of finite discrete values.
When we start addressing this problem, we will begin with a dataset that is already available to us and will train our model of choice on this dataset. This process (as you've already guessed) is known as training your model. We will use the data provided to us to guess the parameters of the learning model of our choice (we will elaborate more on what this means later). This is known as statistical inference of these parametric learning models. There are also non-parametric learning models, where we try to remember the data we've seen so far to make a guess as regards new data.
Once we are done training our model, we will use it to predict weight for datasets we haven't seen yet. Obviously, this is the part we are interested in. Based on data in the future that we haven't seen yet, can we predict the weight? This is known as testing your model and the datasets used for that are known as test data. The task of using a model where the parameters were learned by statistical inference to actually make predictions on previously unseen data is known as statistical prediction or forecasting.
We need to be able to understand the metrics of how to differentiate between a good model and a bad model. There are several well known and well understood performance metrics for different models. For regression prediction problems, we should try to minimize the differences between predicted value and the actual value of the target variable. This error term is known as residual errors; larger errors mean worse models and, in regression, we try to minimize the sum of these residual errors, or the sum of the square of these residual errors (squaring has the effect of penalizing large outliers more strongly, but more on that later). The most common metric for regression problems is R^2, which tracks the ratio of explained variance vis-à-vis unexplained variance, but we save that for more advanced texts.
In the simple hypothetical prediction problem of guessing weight based on height and race, let's say the model predicts the weight to be 170 and the actual weight is 160. In this case, the error is 160-170 = -10, the absolute error is | -10| = 10, and the squared error is (-10)^2 =100. In classification problems, we want to make sure our predictions are the same discrete value as the actual value. When we predict a label that is different from the actual label, that is a misclassification or error. Obviously, the higher the number of accurate predictions, the better the model, but it gets more complicated than that. There are metrics such as a confusion matrix(https://blog.csdn.net/Linli522362242/article/details/120093948), a Receiver Operating Characteristic(the ROC curve plots the true positive rate (another name for recall) against the false positive rate(TPR vs FPR)), and the area under the curve(https://blog.csdn.net/Linli522362242/article/details/103786116, A perfect classifier will have a ROC AUC equal to 1, whereas a purely random classifier will have a ROC AUC equal to 0.5.), but we save those for more advanced texts. Let's say, in the modified hypothetical problem of guessing race based on height and weight, that we guess the race to be Caucasian while the correct race is African. That is then considered an error, and we can aggregate all such errors to find the aggregate errors across all predictions, but we will talk more on this in the later parts of the book.
So far, we have been speaking in terms of a hypothetical example, but let's tie the terms we've encountered so far into how it applies to financial datasets. As we mentioned, supervised learning methods are most common here because, in historical financial data, we are able to measure the price movements from the data. If we are simply trying to predict that, if a price moves up or down from the current price, then that is a classification problem with two prediction labels – Price goes up and Price goes down. There can also be three prediction labels since Price goes up, Price goes down, and Price remains the same. If, however, we want to predict the magnitude and direction of price moves, then this is a regression problem where an example of the output could be Price moves +10.2 dollars, meaning the prediction is that the price will move up by $10.2. The training dataset is generated from historical data, and this can be historical data that was not used in training the model and the live market data实时市场数据 during live trading. We measure the accuracy of such models with the metrics we listed above in addition to the PnL盈亏 generated from the trading strategies. With this introduction complete, let's now look into these methods in greater detail, starting with regression methods.
Before we start applying machine learning techniques to build predictive models, we need to perform some exploratory data wrangling[ˈræŋɡlɪŋ]数据整理 on our dataset with the help of the steps listed here. This is often a large and an underestimated prerequisite when it comes to applying advanced methods to financial datasets.
- import pandas as pd
- from pandas_datareader import data
- def load_financial_data( start_date, end_date, output_file='', stock_symbol='GOOG' ):
- if len(output_file) == 0:
- output_file = stock_symbol+'_data_large.pkl'
- try:
- df = pd.read_pickle( output_file )
- print( "File data found. . . reading {} data".format(stock_symbol) )
- except FileNotFoundError:
- print( "File not found. . . downloading the {} data".format(stock_symbol) )
- df = data.DataReader( stock_symbol, 'yahoo', start_date, end_date)
- df.to_pickle( output_file )
- return df
In the code, we revisited how to download the data and implement a method, load_financial_data , which we can use moving forward. It can also be invoked, as shown in the following code, to download 17 years' of daily Google data: - goog_data = load_financial_data( start_date='2001-01-01',
- end_date='2018-01-01',
- )
- goog_data.head()
The code will download financial data over a period of 17 years from GOOG stock data. Now, let's move on to the next step.- def create_classification_trading_condition( df ):
- df['Open-Close'] = df.Open - df.Close
- df['High-Low'] = df.High - df.Low
- df = df.dropna( axis=0)
- X = df[ ['Open-Close', 'High-Low'] ]
- # the close price tomorrow > the close price today
- Y = np.where( df['Close'].shift(-1) > df['Close'],
- 1, -1
- )
- return (X,Y)
The regression response variable is Close price tomorrow-Close price today for each day.- def create_regression_trading_condition( df ):
- df['Open-Close'] = df.Open - df.Close
- df['High-Low'] = df.High - df.Low
- # the difference between the close price tomorrow and the close price today
- df['Target'] = df['Close'].shift(-1) - df['Close']
- df = df.dropna( axis=0 ) # the last item after doing shift(-1) will be nan
- X = df[ ['Open-Close', 'High-Low'] ]
- Y = df[['Target']]
- return (df, X,Y)
- from sklearn.model_selection import train_test_split
- def create_train_split_group( X,y, split_ratio=0.8 ):
- # shufflebool, default=True
- # Whether or not to shuffle the data before splitting.
- # If shuffle=False then stratify must be None.
- # stratify
- # https://blog.csdn.net/Linli522362242/article/details/103387527
- return train_test_split( X, Y, shuffle=False, # since the stock data is a kind of time-series data
- train_size = split_ratio
- )
Now that we know how to get the datasets that we need, how to quantify what we are trying to predict (objectives), and how to split data into training and testing datasets to evaluate our trained models on, let's dive into applying some basic machine learning techniques to our datasets:
Given observations of the target variables, m x 1 rows of features values, and each row of dimension 1 x n, OLS seeks to find the weights of dimension that minimize the residual sum of squares of differences between the target variable and the predicted variable predicted by linear approximation:
There are many underlying assumptions for OLS in addition to the assumption that
The following diagram is a very simple example showing a relatively close linear relationship between two arbitrary variables. Note that it is not a perfect linear relationship, in other words, not all data points lie perfectly on the line and we have left out省略了 the X and Y labels because these can be any arbitrary variables. The point here is to demonstrate an example of what a linear relationship visualization looks like. Let's have a look at the following diagram:
1. start by loading up Google data in the code, using the same method that we introduced in the previous section:
- goog_data = load_financial_data( start_date='2001-01-01',
- end_date='2018-01-01',
- output_file='goog_data_large.pkl'
- )
- goog_data.head()
2. Now, we create and populate the target variable vector, Y, for regression in the following code. Remember that what we are trying to predict in regression is magnitude and the direction of the price change from one day to the next:
- # def create_regression_trading_condition( df ):
- # df['Open-Close'] = df.Open - df.Close
- # df['High-Low'] = df.High - df.Low
- # df = df.dropna( axis=0 )
- # X = df[ ['Open-Close', 'High-Low'] ]
- # # the difference between the close price tomorrow and the close price today
- # Y = df['Target'] = df['Close'].shift(-1) - df['Close']
- # return (df, X,Y)
- goog_data, X, Y = create_regression_trading_condition( goog_data )
3. With the help of the code, let's quickly create a scatter plot for the two features we have: High-Low price of the day and Open-Close price of the day against the target variable, which is Price-Of-Next-Day - Price-Of-Today (future price):
- import matplotlib.pyplot as plt
- pd.plotting.scatter_matrix( goog_data[['Open-Close', 'High-Low', 'Target']],
- grid=True,
- figsize=(10,6),
- diagonal='kde'# kernel density estimate
- )# computing an estimate of a continuous probability distribution
- # that might have generated the observed data
- plt.show()
Using this scatter matrix, we can now quickly eyeball how the data is distributed and whether it contains outliers. For example, we can see in the kde (the lower right subplot in the scatter plot matrix) that the Target variable seems to be normally distributed but contains several outliers. Besides, the relationship between the variables (Open-Close and High-Low) and the target variable (target) is not linear.https://blog.csdn.net/Linli522362242/article/details/111307026
- import seaborn as sns
- import matplotlib.pyplot as plt
- # g = sns.pairplot( goog_data[cols],
- # # If True, don’t add axes to the upper (off-diagonal) triangle of the grid
- # corner=False,
- # height=1.5,
- # aspect=2, # Aspect * height gives the width (in inches) of each facet.
- # diag_kind='kde'
- # )
- g = sns.PairGrid( goog_data[cols],
- height=1.5,
- aspect=2
- )
- # g = g.map( sns.scatterplot )
- g.map_diag( sns.kdeplot )
- g.map_offdiag( plt.scatter ) # since I want to display all xlabels and ylabels
- # g.map_offdiag( sns.scatterplot )
- # sns.despine( left=False,
- # bottom=False,
- # # right=False,top=False
- # )
- # remove the upper axes and better than setting corner=False,
- # g.fig.get_axes()
- # return:
- # [<matplotlib.axes._subplots.AxesSubplot at 0x1830b588>,
- # <matplotlib.axes._subplots.AxesSubplot at 0x183b5e08>,
- # <matplotlib.axes._subplots.AxesSubplot at 0x18491388>,
- # <matplotlib.axes._subplots.AxesSubplot at 0x1877e988>,
- # <matplotlib.axes._subplots.AxesSubplot at 0x187b7608>,
- # <matplotlib.axes._subplots.AxesSubplot at 0x187ebfc8>,
- # <matplotlib.axes._subplots.AxesSubplot at 0x18827d88>,
- # <matplotlib.axes._subplots.AxesSubplot at 0x18971788>,
- # <matplotlib.axes._subplots.AxesSubplot at 0x189a3c48>]
- for ax in g.fig.get_axes():
- if ax.get_geometry()[2] in [2,3,6]:
- ax.remove()
- xlabels,ylabels = [],[]
- for ax in g.axes[-1,:]: # g.axes[-1,:] get the last row axes
- xlabel = ax.xaxis.get_label_text()
- xlabels.append(xlabel)
- for ax in g.axes[:,0]: # g.axes[:,0] get the first column axes
- ylabel = ax.yaxis.get_label_text()
- ylabels.append(ylabel)
- for row in range( len(ylabels) ):
- for col in range( len(xlabels) ):
- if g.axes[row,col] != None :
- g.axes[row,col].xaxis.set_label_text( xlabels[col] )
- g.axes[row,col].yaxis.set_label_text( ylabels[row] )
- plt.subplots_adjust( top=1.5 )
- plt.show()
g.map_offdiag( sns.scatterplot ) instead of g.map_offdiag( plt.scatter )
4. Finally, as shown in the code, let's split 80% of the available data into the training feature value and target variable set ( X_train , Y_train ), and the remaining 20% of the dataset into the out-sample testing feature value and target variable set ( X_test , Y_test ):
- X_train, X_test, Y_train, Y_test = create_train_split_group( X,Y, split_ratio=0.8 )
- X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
5. Now, let's fit the OLS model as shown here and observe the model we obtain:
- conda install sklearn
- conda install -c anaconda scikit-learn
- from sklearn import linear_model
- # Fit the model
- ols = linear_model.LinearRegression()
- ols.fit( X_train, Y_train )
6. The coefficients are the optimal weights assigned to the two features by the fit method. We will print the coefficients as shown in the code:
- print( 'Intercept: \n', ols.intercept_,
- '\nCoefficients: \n' , ols. coef_
- )
7. The next block of code quantifies two very common metrics that test goodness of fit for the linear model we just built. Goodness of fit means how well a given model fits the data points observed in training and testing data. A good model is able to closely fit most of the data points and errors/deviations between observed and predicted values are very low. Two of the most popular metrics for linear regression models are mean_squared_error, which is what we explored as our objective to minimize when we introduced OLS, and R-squared (), which is another very popular metric that measures how well the fitted model predicts the target variable when compared to a baseline model whose prediction output is always the mean of the target variable based on training data, that is, .
Let's compute the MSE of our training and test predictions:
- from sklearn.metrics import mean_squared_error
- print('MSE train: %.3f, test: %.3f' % ( mean_squared_error(y_train, y_train_pred),
- mean_squared_error(y_test, y_test_pred)
- ) )
You can see that the MSE on the training dataset is 19.96, and the MSE on the test dataset is much larger, with a value of 27.20, which is an indicator that our model is overfitting the training data in this case. However, please be aware that the MSE is unbounded in contrast to the classification accuracy, for example. In other words, the interpretation of the MSE depends on the dataset and feature scaling. For example, if the house prices were presented as multiples of 1,000 (with the K suffix后缀), the same model would yield a lower MSE compared to a model that worked with unscaled features. To further illustrate this point, ($10K − 15K)^2 < ($10,000 − $15,000)^2 .
Thus, it may sometimes be more useful to report the coefficient of determination决定系数 () , which can be understood as a standardized version of the MSE, for better interpretability of the model's performance. Or, in other words, is the fraction of response variance响应方差的分数 that is captured by the model. The value is defined as:
Let's quickly show that is indeed just a rescaled version of the MSE:
#rescaled by the variance of y :
For the training dataset, the is bounded between 0 and 1, but it can become negative for the test dataset. If = 1, the model fits the data perfectly with a corresponding MSE = 0(since Var(y)>0).
Evaluated on the training data, the of our model is 0.765, which doesn't sound too bad. However, the on the test dataset is only 0.673, which we can compute by executing the following code:
- from sklearn.metrics import r2_score
- print('R^2 train: %.3f, test: %.3f' % ( r2_score(y_train, y_train_pred),
- r2_score(y_test, y_test_pred)
- ) )
Coefficient of determination, in statistics, (or r^2), a measure that assesses the ability of a model to predict or explain an outcome in the linear regression setting. More specifically, R2 indicates the proportion of the variance方差 in the dependent variable 因变量(Y) that is predicted or explained by linear regression and the predictor variable (X, also known as the independent variable自变量).
The coefficient of determination shows only association. As with linear regression, it is impossible to use to determine whether one variable causes the other. In addition, the coefficient of determination shows only the magnitude of the association, not whether that association is statistically significant.
In general, a high R2 value indicates that the model is a good fit for the data, although interpretations of fit depend on the context of analysis.
R2 increases when a new predictor variable is added to the model, even if the new predictor is not associated with the outcome. To account for that effect, the adjusted R2 (typically denoted with a bar over the R in R2) incorporates the same information as the usual but then also penalizes for the number(k) of predictor variables included in the model. As a result, R2 increases as new predictors are added to a multiple linear regression model, but the adjusted R2 increases only if the increase in R2 is greater than one would expect from chance alone仅当新项对模型的改进超出偶然的预期时,the adjusted R2 才会增加. It decreases when a predictor improves the model by less than expected by chance.In such a model, the adjusted R2 is the most realistic estimate of the proportion of the variation that is predicted by the covariates included in the model.
We will skip the exact formulas for computing but, intuitively, the closer the value to 1, the better the fit, and the closer the value to 0, the worse the fit. Negative values mean that the model fits worse than the baseline model. Models with negative values usually indicate issues in the training data or process and cannot be used:
- from sklearn.metrics import mean_squared_error, r2_score
- # The mean square error
- print( "Mean squared error: %.2f" % mean_squared_error( Y_train,
- ols.predict( X_train )
- )
- )
- # Explained variance score: 1 is perfect prediction
- print( "Variance score: %.2f" % r2_score( Y_train,
- ols.predict(X_train)
- )
- )
- # The mean square error
- print( "Mean squared error: %.2f" % mean_squared_error( Y_test,
- ols.predict(X_test)
- )
- )
- # Explained variance score: 1 is perfect prediction
- print( "Variance score: %.2f" % r2_score( Y_test,
- ols.predict(X_test)
- )
- )
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
- error_list=Y_test-Y_predicted
- np.any( np.isnan( error_list ) )
I forgot to dropna after df.shift(-1)
if not dropna in the data process:
- from sklearn.metrics import mean_squared_error, r2_score
- # The mean square error
- print( "Mean squared error: %.2f" % mean_squared_error( Y_train,
- ols.predict( X_train )
- )
- )
- # Explained variance score: 1 is perfect prediction
- print( "Variance score: %.2f" % r2_score( Y_train,
- ols.predict(X_train)
- )
- )
- # The mean square error
- print( "Mean squared error: %.2f" % mean_squared_error( Y_test[:-1],
- ols.predict(X_test[:-1])
- )
- )
- # Explained variance score: 1 is perfect prediction
- print( "Variance score: %.2f" % r2_score( Y_test[:-1],
- ols.predict(X_test[:-1])
- )
- )
Negative values mean that the model fits worse than the baseline model. Models with negative values usually indicate issues in the training data or process and cannot be used
- from sklearn.metrics import mean_squared_error, r2_score
- # The mean square error
- print( "Mean squared error: %.2f" % mean_squared_error( Y_train,
- ols.predict( X_train )
- )
- )
- # Explained variance score: 1 is perfect prediction
- print( "Variance score: %.2f" % r2_score( Y_train,
- ols.predict(X_train)
- )
- )
- # The mean square error
- print( "Mean squared error: %.2f" % mean_squared_error( Y_test,
- ols.predict(X_test)
- )
- )
- # Explained variance score: 1 is perfect prediction
- print( "Variance score: %.2f" % r2_score( Y_test,
- ols.predict(X_test)
- )
- )
8. Finally, as shown in the code, let's use it to predict prices and calculate strategy returns:
The regression response variable is Close price tomorrow-Close price today for each day.
---It is a positive value if the price goes up tomorrow, a negative value if the price goes down tomorrow, and zero if the price does not change.
---The sign of the value indicates the direction, and the magnitude of the response variable captures the magnitude of the price move.
- # # the difference between the close price tomorrow and the close price today
- # df['Target'] = df['Close'].shift(-1) - df['Close']
- goog_data['Predicted_Signal'] = ols.predict(X)
- # Normal log returns log( Close price today ) - log( Close price yesterday )
- goog_data['GOOG_Returns'] = np.log( goog_data['Close']/goog_data['Close'].shift(1) )
- def calculate_return( df, split_value, symbol ):
- cum_goog_return = df[split_value:][ "%s_Returns" % symbol ].cumsum() * 100
- # Calculates the log returns of the trading strategy
- # given the prediction values and the benchmark log returns.
- # log actual return today * Predicted return today
- df['Strategy_Returns'] = df["%s_Returns" % symbol] * df['Predicted_Signal'].shift(1)
- # for classification
- # df['Strategy_Returns']=df["%s_Returns" % symbol] * np.sign( df['Predicted_Signal'].shift(1) )
- return cum_goog_return
- def calculate_strategy_return( df, split_value, symbol ):
- cum_strategy_return = df[split_value:]['Strategy_Returns'].cumsum() * 100
- return cum_strategy_return
- cum_goog_return = calculate_return( goog_data,
- split_value=len(X_train), symbol='GOOG' )
- cum_strategy_return = calculate_strategy_return( goog_data,
- split_value=len(X_train), symbol='GOOG' )
- def plot_chart( cum_symbol_return, cum_strategy_return, symbol ):
- plt.figure( figsize=(15,6) )
- plt.plot( cum_symbol_return, label='%s Returns' % symbol )
- plt.plot( cum_strategy_return, label='Strategy Returns' )
- plt.legend()
- plt.show()
- plot_chart( cum_goog_return, cum_strategy_return, symbol='GOOG' )
The simplified approach taken here does not account for transaction costs
Here, we can observe that the simple linear regression model using only the two features,
Open-Close and High-Low, returns positive returns. However, it does not outperform the
Google stock's return because it has been increasing in value since inception[ɪnˈsepʃn]. But since that cannot be known ahead of time, the linear regression model, which does not assume/expect increasing stock prices, is a good investment strategy.
- # Sharpe ratio: The risk-adjusted return. This ratio is important
- # because it compares the return of the strategy with a risk-free strategy
- def sharpe_ratio( symbol_returns, strategy_returns ):
- strategy_std = strategy_returns.std()
- sharpe = (strategy_returns-symbol_returns) / strategy_std
- return sharpe.mean()
- print( sharpe_ratio(cum_strategy_return, cum_goog_return) )
- def calculate_return( df, split_value, symbol ):
- cum_goog_return = df[split_value:][ "%s_Returns" % symbol ].cumsum() * 100
- # Calculates the log returns of the trading strategy
- # given the prediction values and the benchmark log returns.
- # log actual return today * Predicted return today
- # df['Strategy_Returns'] = df["%s_Returns" % symbol] * df['Predicted_Signal'].shift(1)
- # for classification
- df['Strategy_Returns']=df["%s_Returns" % symbol] * np.sign( df['Predicted_Signal'].shift(1) )
- return cum_goog_return
- cum_goog_return = calculate_return( goog_data,
- split_value=len(X_train), symbol='GOOG' )
- cum_strategy_return = calculate_strategy_return( goog_data,
- split_value=len(X_train), symbol='GOOG' )
- plot_chart( cum_goog_return, cum_strategy_return, symbol='GOOG' )
- print( sharpe_ratio(cum_strategy_return, cum_goog_return) )
np.sign( goog_data['Predicted_Signal'].shift(1) )[-10:]
Now that we have covered OLS, we will try to improve on that by using regularization and coefficient shrinkage using LASSO and Ridge regression. One of the problems with OLS is that occasionally, for some datasets,
Regularization tries to address both problems, that is, the problem of too many predictors and the problem of predictors with very large coefficients. Too many predictors in the final model is disadvantageous because it leads to overfitting, in addition to requiring more computations to predict. Predictors with large coefficients are disadvantageous because a few predictors with large coefficients can overpower the entire model's prediction, and small changes in predictor values can cause large swings in predicted output. We address this by introducing the concepts of regularization and shrinkage.
Regularization is the technique of introducing a penalty term on the coefficient weights and making that a part of the mean squared error, which regression tries to minimize. Intuitively, what this does is that it will let coefficient values grow, but only if there is a comparable decrease in MSE values. Conversely, if reducing the coefficient weights doesn't increase the MSE values too much, then it will shrink those coefficients. The extra penalty term is known as the regularization term, and since it results in a reduction of the magnitudes of coefficients, it is known as shrinkage.
Depending on the type of penalty term involving magnitudes of coefficients, it is either L1 regularization or L2 regularization. When the penalty term is the sum of the absolute values of all coefficients, this is known as L1 regularization (LASSO), and, when the penalty term is the sum of the squared values of the coefficients, this is known as L2 regularization (Ridge)OR . It is also possible to combine both L1 and L2 regularization, and that is known as elastic net regression. To control how much penalty is added because of these regularization terms, we control it by tuning the regularization hyperparameter. In the case of elastic net regression, there are two regularization hyperparameters, one for the L1 penalty and the other one for the L2 penalty.
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。