Grid Search: Searching for estimator parameters
scikit-learn中提供了pipeline(for estimator connection) & grid_search(searching best parameters)进行并行调参
如使用scikit-learn做文本分类时:vectorizer取多少个word呢?预处理时候要过滤掉tf>max_df的words,max_df设多少呢?tfidftransformer只用tf还是加idf呢?classifier分类时迭代几次?学习率怎么设? “循环一个个试”,这就是grid search要做的基本东西。
Hyper-parameters are parameters that are not directly learnt within estimators.In scikit-learn they are passed as arguments to the constructor of theestimator classes.
It is possible and recommended to search the hyper-parameter space for the best Cross-validation: evaluating estimator performance score.
Any parameter provided when constructing an estimator may be optimized in thismanner. Specifically, to find the names and current values for all parametersfor a given estimator, use:
A search consists of:
exhaustively considersall parameter combinations, while
can sample agiven number of candidates from a parameter space with a specifieddistribution.
Gird Search:具体说,就是每种参数确定好几个要尝试的值,然后像一个网格一样,把所有参数值的组合遍历一下。优点是实现简单暴力,如果能全部遍历的话,结果比较可靠。缺点是太费时间了,特别像神经网络,一般尝试不了太多的参数组合。
param_grid = [ {'C': [1, 10, 100, 1000], 'kernel': ['linear']}, {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']}, ]最好的实例 Nested versus non-nested cross-validationfor an example of Grid Search within a cross validation loop on the irisdataset
Random Search:先用Gird Search的方法,得到所有候选参数,然后每次从中随机选择进行训练。
sklearn.model_selection.RandomizedSearchCV(estimator, param_distributions, n_iter=10, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', random_state=None, error_score='raise', return_train_score=True)
two main benefits over an exhaustive search:
{'C': scipy.stats.expon(scale=100), 'gamma': scipy.stats.expon(scale=.1), 'kernel': ['rbf'], 'class_weight':['balanced', None]}In principle, any function can be passed that provides a
(randomvariate sample) method to sample a value.
实例Comparing randomized search and grid search for hyperparameter estimation compares the usage and efficiencyof randomized search and grid search.
estimator类必须有的方法是有:get_params, set_params(**params), fit(x,y), predict(new_samples), score(x, y_true)。其中有的可以直接从from sklearn.base import BaseEstimator中继承。
使用验证集(也就是开发集吧)来进行模型选择,输入到grid_search中。development set (tobe fed to the GridSearchCV instance)
by using the keyword
参数输入后模型出错会导致整个grid serach失败,但是可以通过Setting error_score=0(or =np.NaN)来解决。失败的issuing awarning and setting the score for that fold to 0 (or NaN)。
Some models can offer an information-theoretic closed-form formula of theoptimal estimate of the regularization parameter by computing a singleregularization path (instead of several when using cross-validation).
Here is the list of models benefitting from the Aikike InformationCriterion (AIC) or the Bayesian Information Criterion (BIC) for automatedmodel selection:
考虑到了不同参数对应的实验结果值,因此更节省时间。和网络搜索相比简直就是老牛和跑车的区别。具体原理可以参考这个论文: Practical Bayesian Optimization of Machine Learning Algorithms ,这里同时推荐两个实现了贝叶斯调参的Python库,可以上手即用:
