赞
踩
Both covariance and correlation measure the linear relationship and the dependency between two variables.
Correlation values are standardized. (Pearson correlation coefficient)
Covariance values are not standardized.
Correlation only measure the linear relationship between two variables.
Correlation is zero, then the two variables have no linear relationship, but they may have non-linear relationship.
基本的统计和机器学习概念
不要引入一些其他概念,从而引导面试官提问,但是如果对引入的概念非常熟悉,那么可以将面试官往这上面引导。
You can use the elbow method, which is a popular method used to determine the optimal value of k. Essentially, what you do is plot the squared error for each value of k on a graph (value of k on the x-axis and squared error on the y-axis). Once the graph is made, the point where the distortion declines the most is the elbow point.
https://medium.com/analytics-vidhya/elbow-method-of-k-means-clustering-algorithm-a0c916adc540
Naive Bayes is naive because it holds a strong assumption in that the features are assumed to be uncorrelated with one another, which typically is never the case.
Naive Bayes is better in the sense that it is easy to train and understand the process and results. A random forest can seem like a black box. Therefore, a Naive Bayes algorithm may be better in terms of implementation and understanding.
However, in terms of performance, a random forest is typically stronger because it is an ensemble technique.
There are a couple of ways to identify outliers:
There are two types of methods for feature selection: filter methods and wrapper methods.
Filter methods include the following:
Wrapper methods: evaluate models using procedures that add and/or remove predictors to find the optimal combination that maximizes model performance.
include the following:
Business sense
https://towardsdatascience.com/chi-square-test-for-feature-selection-in-machine-learning-206b1f0b8223
https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/
Mean Squared Error (MSE) gives a relatively high weight to large errors — therefore, MSE tends to put too much emphasis on large deviations. A more robust alternative is MAE (mean absolute deviation).
Principal component analysis (PCA) projects data to a lower-dimensional linear subspace, in a way that preserves the axes of highest variance in the data.
An important pre-processing step before applying PCA is to standardize features to have zero mean and unit variance.
1. Non-linear
How to determine linear/non-linear relationship
2. Our sample is non-random
Thus, it is good practice to do some EDA prior to building a regression model to confirm that the two groups are not drastically different
For serially correlated Y values, the estimates of the slope and intercept will be unbiased, but the estimates of their variances will not be reliable.
If you are unsure whether your Y values are independent, you may wish to consult a statistician or someone who is knowledgeable about the data collection scheme you are using.
do some EDA prior to building a regression model to confirm
3. We Have Perfect Collinearity
How to solve it
4. Our Error Term is Correlated with One of Our Independent Variables
How to solve it
5. violating the homoscedasticity
How to solve it
6. Our Errors are Non-Normal
How to solve it
What is flexibility in machine learning?
https://stackoverflow.com/questions/26437372/what-is-the-definition-of-flexibility-of-a-method-in-machine-learning
https://stats.stackexchange.com/questions/338009/how-is-model-flexibility-measured-or-quantified-what-units-is-it-measured-in/338896
https://stats.stackexchange.com/questions/193538/how-to-choose-alpha-in-cost-complexity-pruning
Common techniques include filter methods that select features based on their statistical relationship to the target variable (e.g. correlation), and wrapper methods that select features based on their contribution to a model when predicting the target variable (e.g. RFE(Recursive Feature Elimination)).
They are often used for visualization, although the dimensionality reduction nature of the techniques may also make them useful as a data transform to reduce the number of predictors. This might include techniques from linear algebra, such as SVD and PCA.
This will penalize models based on the number of features used or weighting of features, encouraging the model to perform well and minimize the number of predictors used in the model.
This can act as a type of automatic feature selection during training and may involve augmenting existing models (e.g regularized linear regression and regularized logistic regression) or the use of specialized methods such as LARS and LASSO.
There is no best method and it is recommended to use controlled experiments to test a suite of different methods.
这些基本对所有的模型都是一样的。
解决方法基本一样。
The support vectors are the data points that touch the boundaries of the maximum margin (see below).
Would training the soft-margin SVMs either using primal or the dual formulation yield the same results on a test dataset? Why?
Yes. because the pattern is convex and the primal & dual will yield the same result.
注意这里并不是说特征数量比样本数量多。如果特征数量比样本多,仍然主要是通过前文所写的几种方法。
https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205
The main hypothesis is that when weak models are correctly combined we can obtain more accurate and/or robust models.
Indeed, to be able to “solve” a problem, we want our model to have enough degrees of freedom to resolve the underlying complexity of the data we are working with, but we also want it to have not too much degrees of freedom to avoid high variance and be more robust. This is the well known bias-variance tradeoff.
Weak learners: High bias or high variance.
Bagging (reduce variance), that often considers homogeneous weak learners, learns them independently from each other in parallel and combines them following some kind of deterministic averaging process.
Boosting (reduce bias), that often considers homogeneous weak learners, learns them sequentially in a very adaptative way (a base model depends on the previous ones) and combines them following a deterministic strategy.
Stacking (reduce bias), that often considers heterogeneous weak learners, learns them in parallel and combines them by training a meta-model to output a prediction based on the different weak models predictions.
We can use linear regression, but it not common in practice.
There are a couple of reasons why a random forest is a better choice of an algorithm than a support vector machine:
Technically speaking, the bootstrap sampling method is a resampling method that uses random sampling with replacement.
It’s an essential part of the random forest algorithm, as well as other ensemble learning algorithms.
Bootstrap Aggregation is a general procedure that can be used to reduce the variance for those algorithm that have high variance.
Pros:
Cons:
https://zhuanlan.zhihu.com/p/27160995
The main differences are that Gradient Boosting is a generic algorithm to find approximate solutions to the additive modeling problem, while AdaBoost can be seen as a special case with a particular loss function. Hence, Gradient Boosting is much more flexible.
AdaBoost:
Gradient Boosting:
Every tree fit to the residuals from the previous tree.
Previous tree residual is nothing but the gradient of the loss function.
GradientBoostingClassifier
HistGradientBoostingClassifier
● Orders of magnitude faster than GradientBoostingClassifier on large datasets
● Inspired by LightGBM implementation
● Histogram-based split finding in tree learning
● Does not support sparse data
● Supports both binary & multi-class classification
● Natively supports categorical features
● Does not support monotonicity constraints
XGBoost
● One of most popular implementations of gradient boosting
● Fast approximate split finding based on histograms
● Supports GPU training, sparse data & missing values
● Adds l1 and l2 penalties on leaf weights
● Monotonicity & feature interaction constraints
● Works well with pipelines in sklearn due to a compatible interface
● Does not support categorical variables natively
LightGBM
● Supports GPU training, sparse data & missing values
● Histogram-based node splitting
● Use Gradient-based One-Sided Sampling (GOSS) for tree learning
● Exclusive feature bundling to handle sparse features
● Generally faster than XGBoost on CPUs
● Supports distributed training on different frameworks like Ray, Spark, Dask etc.
● CLI version
CatBoost
● Optimized for categorical features
● Uses target encoding to handle categorical features
● Uses ordered boosting to build “symmetric” trees
● Overfitting detector
● Tooling support (Jupyter notebook & Tensorboard visualization)
● Supports GPU training, sparse data & missing values
● Monotonicity constraints
● Several training samples (of same size) are created by sampling the dataset with replacement
● Each training sample is then used to train a model
● The outputs from each of the models are averaged to make the final prediction.
Disadvantage:
All steps of calculation.
Python Code
import math
def sigmoid(x, a, b):
return 1.0/(1.0 + math.exp(-a*x - b))
def GradientDescent(x, y, a, b):
a, b = a - (y - (sigmoid(x,a,b)) * x), b - (y - (sigmoid(x,a,b)))
return a, b
TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection or corpus.
Term Frequency (TF): It is a measure of the frequency of a word (w) in a document (d). TF is defined as the ratio of a word’s occurrence in a document to the total number of words in a document. The denominator term in the formula is to normalize since all the corpus documents are of different lengths.
Term Frequency — Inverse Document Frequency (TFIDF)
TFIDF gives more weightage to the word that is rare in the corpus (all the documents).
TFIDF provides more importance to the word that is more frequent in the document.
Advantages: Bag of words (BoW) converts the text into a feature vector by counting the occurrence of words in a document. It is not considering the importance of words. TFIDF is based on BoW model, which contains insights about the less relevant and more relevant words in a document. The importance of a word in the text is of great significance in information retrieval.
Disadvantage of TFIDF. It is unable to capture the semantics. For example, funny and humorous are synonyms, but TFIDF does not capture that. Moreover, TFIDF can be computationally expensive if the vocabulary is vast.
4995 Lecture2
Fit on Training data. Transform on Validation and Testing.
Train final model on Development data.
Degrades the model’s ability (especially regression based models) to describe typical cases as it has to deal with rare cases on extreme values. ie right skewed data will predict better on data points with lower value as compared to those with higher values. Skewed data also does not work well with many statistical methods. However, tree based models are not affected.
Log transformation
Linear model:
Multicollinearity will harm the interpretability greatly but hurt the performance slightly.
We could understand it from the meaning of the coefficients (Weights).
Coefficient (Weight) of one feature means that if we fix the values of other features, then this feature increases by 1, the result will increase by 1*coefficient. However, if there is multicollinearity in our data, we cannot assume that one feature changes and other features are fixed at the same time. Therefore, these coefficients will be very confusing and completely unexplainable.
In the case of linear regression, multi-collinearity leads to finding weights/coefficients that have very high variance (since (XTX)-1 becomes huge). So, it is difficult to say anything precise about the coefficients. So, in some cases it does hurt the model performance as well as the model interpretability.
Yes, one-hot encoding leads to having features that have dependencies (i.e. multi-collinearity). One way to reduce it would be drop the variables that you believe would not necessarily impact your model interpretability. Another way to drop variables is to use Variance Inflation Factor to drop features that have high multi-collinearity:
https://www.statisticshowto.com/variance-inflation-factor/
In this case, you could make this part of your model selection process, where you iteratively drop features using VIF, then estimating the performance of the model on a validation set and finally choosing the model with the highest performance.
Tree model:
How to fix multicollinearity?
https://towardsdatascience.com/how-to-build-a-baseline-model-be6ce42389fc
A baseline is a method that uses heuristics, simple summary statistics, randomness, or machine learning to create predictions for a dataset.
statistics and randomness
Classification baselines:
Regression baselines:
Machine Learning
Baseline model should be simple. Simple models are less likely to overfit. If you see that your baseline is already overfitting, it makes no sense to go for more complex modeling, as the complexity will kill the performance.
Baseline model should be interpretable. Interpretability will help you to get a better understanding of your data and will show you a direction for the feature engineering.
https://towardsdatascience.com/interperable-vs-explainable-machine-learning-1fa525e12f48
● Threshold-based metrics
○ Classification Accuracy
○ Precision, Recall & F1-score
● Ranking-based metrics
○ Average Precision (AP)
○ Area Under Curve (AUC)
Accuracy:
Precision, Recall & F1-score
Normalization
Minority class is considered positive
Macro is a more balanced metrics.
Choosing the Right Metric
● Problem-specific
● Balanced accuracy better than accuracy (most of the times)
● Cost associated with misclassification
● Predicting that an individual has no cancer when he/she has cancer (false negative) is far more costlier than the other way round
● Predicting an email as spam when it is not (false positive) has higher cost than predicting email as not spam
● Choose recall when cost of false negatives is high (Type I error)
● Choose precision when cost of false positives is high (Type II error)
Precision-Recall (PR) Curve
Receiver Operating Curve (ROC)
Another useful tool to visualize the performance of a classification model
ROC depicts the relationship between False Positive Rate (FPR) and True Positive Rate/Recall (TPR)
FPR = FP / (TN + FP)
Area Under ROC (AUROC)
Area Under ROC (AUROC) provides an aggregate measure of model performance across all possible classification thresholds.
AUROC varies between 0 and 1 and a model with random/const predictions has a value of 0.5.
Tradeoff between Type I and Type II errors:
Setting a lower significance level decreases a Type I error risk, but increases a Type II error risk.
Increasing the power of a test decreases a Type II error risk, but increases a Type I error risk.
The null hypothesis distribution shows all possible results you’d obtain if the null hypothesis is true. The correct conclusion for any point on this distribution means not rejecting the null hypothesis.
The alternative hypothesis distribution shows all possible results you’d obtain if the alternative hypothesis is true. The correct conclusion for any point on this distribution means rejecting the null hypothesis.
Type I and Type II errors occur where these two distributions overlap. The blue shaded area represents alpha, the Type I error rate, and the green shaded area represents beta, the Type II error rate.
By setting the Type I error rate, you indirectly influence the size of the Type II error rate as well.
For statisticians, a Type I error is usually worse. In practical terms, however, either type of error could be worse depending on your research context.
A Type I error means mistakenly going against the main statistical assumption of a null hypothesis. This may lead to new policies, practices or treatments that are inadequate or a waste of resources.
In contrast, a Type II error means failing to reject a null hypothesis. It may only result in missed opportunities to innovate, but these can also have important practical consequences.
Example:
Consequences of a Type I error
Based on the incorrect conclusion that the new drug intervention is effective, over a million patients are prescribed the medication, despite risks of severe side effects and inadequate research on the outcomes. The consequences of this Type I error also mean that other treatment options are rejected in favor of this intervention.
Consequences of a Type II error
If a Type II error is made, the drug intervention is considered ineffective when it can actually improve symptoms of the disease. This means that a medication with important clinical significance doesn’t reach a large number of patients who could tangibly benefit from it.
Example:
You decide to get tested for COVID-19 based on mild symptoms. There are two errors that could potentially occur:
Type I error (false positive): the test result says you have coronavirus, but you actually don’t.
Type II error (false negative): the test result says you don’t have coronavirus, but you actually do.
What the difference between adam optimizer and SGD?
Initializing all the weights with zeros leads the neurons to learn the same features during training.
Add non-linearity into a neural network.
How to avoid it?
https://towardsdatascience.com/optimizers-for-training-neural-network-59450d71caf6
how to select optimizer?
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02
https://stats.stackexchange.com/questions/325263/binary-encoding-vs-one-hot-encoding
Reduce overfitting in NN model?
https://towardsdatascience.com/preventing-deep-neural-network-from-overfitting-953458db800a
https://towardsdatascience.com/exploit-your-hyperparameters-batch-size-and-learning-rate-as-regularization-9094c1c99b55
https://www.quora.com/Why-does-decreasing-the-learning-rate-also-increases-over-fitting-rate-in-a-neural-network
https://www.1point3acres.com/bbs/thread-660530-1-1.html
high-cardinality
Cluster:
https://www.analyticsvidhya.com/blog/2017/02/test-data-scientist-clustering/
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。