SKlearn学习笔记

发布 : 2020-01-23 分类 : 数据科学 浏览 :

本文为我在学习中记录的函数,并加以拓展。

SKlearn 方法

train_test_split() 将数组或矩阵分解为随机序列的训练和测试子集

1
2
3
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
  • random_state 参数相当于设置随机数种子
  • stratify 如果不是“null”,则将数据以分层方式拆分,将其用作类标签。

StratifiedShuffleSplit() 将数据分为多对train/test集并随机打乱

1
2
from  sklearn.model_selection import StratifiedShuffleSplit
StratifiedShuffleSplit(n_splits=10,test_size=None,train_size=None, random_state=None)
  • n_splits是将训练数据分成train/test对的组数,可根据需要进行设置,默认为10
  • 参数 test_size和train_size是用来设置train/test对中train和test所占的比例
  • 参数 random_state相当于随机数种子

CategoricalEncoder类 将array使用onehot或ordinal编码

返回一个sparse array,可以使用toarray()转换为dense array,或者指定编码类型为onehot-dense来得到dense matrix。

1
2
3
4
5
6
from sklearn.preprocessing import CategoricalEncoder

cat_encoder = CategoricalEncoder()
housing_cat_reshaped = housing_cat.values.reshape(-1, 1)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat_reshaped)
housing_cat_1hot
  • encoding : str, ‘onehot’, ‘onehot-dense’ or ‘ordinal’,指定编码类型,默认为onehot。
  • categories : ‘auto’ or a list of lists/arrays of values.
  • dtype : number type, default np.float64
  • handle_unknown : ‘error’ (default) or ‘ignore’

MinMaxScaler() MinMax scaling归一化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
>>> from sklearn.preprocessing import MinMaxScaler
>>>
>>> data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
>>> scaler = MinMaxScaler()
>>> print(scaler.fit(data))
MinMaxScaler(copy=True, feature_range=(0, 1))
>>> print(scaler.data_max_)
[ 1. 18.]
>>> print(scaler.transform(data))
[[ 0. 0. ]
[ 0.25 0.25]
[ 0.5 0.5 ]
[ 1. 1. ]]
>>> print(scaler.transform([[2, 2]]))
[[ 1.5 0. ]]
  • feature_range : tuple (min, max), default=(0, 1),归一化后值的范围
  • copy : boolean, optional, default True,是否复制数据在新的数据上归一化

StandardScaler() 0均值标准化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
>>> from sklearn.preprocessing import StandardScaler
>>>
>>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]
>>> scaler = StandardScaler()
>>> print(scaler.fit(data))
StandardScaler(copy=True, with_mean=True, with_std=True)
>>> print(scaler.mean_)
[ 0.5 0.5]
>>> print(scaler.transform(data))
[[-1. -1.]
[-1. -1.]
[ 1. 1.]
[ 1. 1.]]
>>> print(scaler.transform([[2, 2]]))
[[ 3. 3.]]
  • copy : boolean, optional, default True,是否复制数据在新的数据上执行
  • with_mean : boolean, True by default,若为True则在缩放前将数据居中。但在稀疏矩阵上是行不通的。
  • with_std : boolean, True by default,若为True,则将数据放缩到单位方差或等效于单位标准差

mean_squared_error() 均方误差(MSE and to RMSE)

1
2
3
4
5
6
from sklearn.metrics import mean_squared_error

housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
  • y_true : array-like of shape = (n_samples) or (n_samples, n_outputs) 真实值
  • y_pred : array-like of shape = (n_samples) or (n_samples, n_outputs) 预测值

返回: loss : float or ndarray of floats

mean_absolute_error() 平均绝对误差(MAE)

1
2
3
4
from sklearn.metrics import mean_absolute_error

lin_mae = mean_absolute_error(housing_labels, housing_predictions)
lin_mae
  • y_true : array-like of shape = (n_samples) or (n_samples, n_outputs) 真实值
  • y_pred : array-like of shape = (n_samples) or (n_samples, n_outputs) 预测值

返回:loss : float or ndarray of floats

LinearRegression() 线性回归模型

1
2
3
4
5
6
7
8
9
10
11
12
13
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
# Out:LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

# let's try the full pipeline on a few training instances
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Predictions:", lin_reg.predict(some_data_prepared))
# Out:Predictions: [ 210644.60459286 317768.80697211 210956.43331178 59218.98886849 189747.55849879]
Methods description
fit(X, y[, sample_weight]) Fit linear model.
get_params([deep]) Get parameters for this estimator.
predict(X) Predict using the linear model
score(X, y[, sample_weight]) Returns the coefficient of determination R^2 of the prediction.
set_params(**params) Set the parameters of this estimator.

DecisionTreeRegressor() 决策树模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)
# Out:
# DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
# max_leaf_nodes=None, min_impurity_decrease=0.0,
# min_impurity_split=None, min_samples_leaf=1,
# min_samples_split=2, min_weight_fraction_leaf=0.0,
# presort=False, random_state=42, splitter='best')

housing_predictions = tree_reg.predict(housing_prepared)
# 计算RMSE
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse
# Out: 0.0
Methods description
apply(X[, check_input]) Returns the index of the leaf that each sample is predicted as.
decision_path(X[, check_input]) Return the decision path in the tree
fit(X, y[, sample_weight, check_input, …]) Build a decision tree regressor from the training set (X, y).
get_params([deep]) Get parameters for this estimator.
predict(X[, check_input]) Predict class or regression value for X.
score(X, y[, sample_weight]) Returns the coefficient of determination R^2 of the prediction.
set_params(**params) Set the parameters of this estimator.

RandomForestRegressor() 随机森林回归

1
2
3
4
5
6
7
8
9
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(random_state=42)
forest_reg.fit(housing_prepared, housing_labels)

housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse
Methods description
apply(X) Apply trees in the forest to X, return leaf indices.
decision_path(X) Return the decision path in the forest
fit(X, y[, sample_weight]) Build a forest of trees from the training set (X, y).
get_params([deep]) Get parameters for this estimator.
predict(X) Predict regression target for X.
score(X, y[, sample_weight]) Returns the coefficient of determination R^2 of the prediction.
set_params(**params) Set the parameters of this estimator.

cross_val_score() K-fold 交叉验证

它的期望是一个效用函数越大越好,所以它的评分函数是一个负值。这就是为什么在计算开平方时取相反数(-scores)

1
2
3
4
5
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)
  • estimator : estimator object implementing ‘fit’ 用于拟合数据的对象,此处使用的是决策树tree_reg。
  • X : array-like 要拟合的数据,可以是list或array。
  • y : array-like, optional, default: None 监督学习下尝试预测的目标值
  • scoring : string, callable or None, optional, default: None 一个字符串,参见模型评估文档。
  • cv : int, cross-validation generator or an iterable, optional,决定交叉验证拆分策略,K-fold

joblib 保存模型

1
2
3
4
5
from sklearn.externals import joblib

joblib.dump(my_model, "my_model.pkl")
# and later...
my_model_loaded = joblib.load("my_model.pkl")

SVR() ε-SVM回归

1
2
3
4
5
6
7
8
from sklearn.svm import SVR

svm_reg = SVR(kernel="linear")
svm_reg.fit(housing_prepared, housing_labels)
housing_predictions = svm_reg.predict(housing_prepared)
svm_mse = mean_squared_error(housing_labels, housing_predictions)
svm_rmse = np.sqrt(svm_mse)
svm_rmse
  • C : float, optional (default=1.0),误差项惩罚参数
  • kernel : string, optional (default=’rbf’), 必须是‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ 或者提供一个函数,如果是提供了函数则将它用来预先计算核心矩阵。
    Methods description
    fit(X, y[, sample_weight]) Fit the SVM model according to the given training data.
    get_params([deep]) Get parameters for this estimator.
    predict(X) Perform regression on samples in X.
    score(X, y[, sample_weight]) Returns the coefficient of determination R^2 of the prediction.
    set_params(**params) Set the parameters of this estimator.

GridSearchCV() 对估算器指定参数值进行详尽搜索

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from sklearn.model_selection import GridSearchCV

param_grid = [
# try 12 (3×4) combinations of hyperparameters
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
# then try 6 (2×3) combinations with bootstrap set as False
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]

forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)

grid_search.best_params_
# Out:{'max_features': 8, 'n_estimators': 30}

grid_search.best_estimator_
# Out:
# RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
# max_features=8, max_leaf_nodes=None, min_impurity_decrease=0.0,
# min_impurity_split=None, min_samples_leaf=1,
# min_samples_split=2, min_weight_fraction_leaf=0.0,
# n_estimators=30, n_jobs=1, oob_score=False, random_state=42,
# verbose=0, warm_start=False)
  • estimator : estimator object.每个估算器需要提供一个score函数或填写scoring参数。
  • param_grid : dict or list of dictionaries,键作为参数名称,list作为参数的字典。或存有这样的字典的列表。
  • scoring : string, callable, list/tuple, dict or None, default: None,
  • cv : int, cross-validation generator or an iterable, optional,如果是整数,则代表KFold
  • refit : boolean, or string, default=True,应用已找到的最好的参数到整个数据集上。
    Methods description
    decision_function(X) Call decision_function on the estimator with the best found parameters.
    fit(X[, y, groups]) Run fit with all sets of parameters.
    get_params([deep]) Get parameters for this estimator.
    inverse_transform(Xt) Call inverse_transform on the estimator with the best found params.
    predict(X) Call predict on the estimator with the best found parameters.
    predict_log_proba(X) Call predict_log_proba on the estimator with the best found parameters.
    predict_proba(X) Call predict_proba on the estimator with the best found parameters.
    score(X[, y]) Returns the score on the given data, if the estimator has been refit.
    set_params(**params) Set the parameters of this estimator.
    transform(X) Call transform on the estimator with the best found parameters.

RandomizedSearchCV()

1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
'n_estimators': randint(low=1, high=200),
'max_features': randint(low=1, high=8),
}

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(housing_prepared, housing_labels)
  • estimator : estimator object.指定估算器对象。
  • param_distributions : dict,给定以参数名为键,list为参数的字典。或提供一个分布,分布必须提供一个rvs方法进行采样,例如来自scipy.stats.distributions的方法。
  • n_iter : int, default=10,采样参数设置数量。
  • scoring : string, callable, list/tuple, dict or None, default: None
  • cv : int, cross-validation generator or an iterable, optional
  • refit : boolean, or string default=True
  • random_state : int, RandomState instance or None, optional, default=None

Imputer() 处理丢失值

各属性必须是数值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from sklearn.preprocessing import Imputer
# 指定用何值替换丢失的值,此处为中位数
imputer = Imputer(strategy="median")

# 使实例适应数据
imputer.fit(housing_num)

# 结果在statistics_ 变量中
imputer.statistics_

# 替换
X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
index = list(housing.index.values))

# 预览
housing_tr.loc[sample_incomplete_rows.index.values]

fetch_mldata() 下载常用的数据集

1
2
3
# 例如下载MNIST数据集
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
  • dataname : str;mldata.org上的数据集的名称,原始名称会自动转换为mldata.org网址。

cross_val_predict() 交叉预测

为每个输入数据点生成交叉验证的估计值

1
2
3
from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
  • estimator : estimator object implementing ‘fit’ and ‘predict’
  • X : array-like
  • y : array-like, optional, default: None
  • cv : int, cross-validation generator or an iterable, optional

通过交叉验证得到F1分数

1
2
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,
method="decision_function")

confusion_matrix() 计算混淆矩阵

计算混淆矩阵来评估分类的准确性

1
2
3
from sklearn.metrics import confusion_matrix

confusion_matrix(y_train_5, y_train_pred)
  • y_true : array, shape = [n_samples]; 正确的目标值
  • y_pred : array, shape = [n_samples]; 分类器返回的目标估计值
  • labels : array, shape = [n_classes], optional;索引矩阵的标签列表。
  • sample_weight : array-like of shape = [n_samples], optional; 样品权重

f1_score() 计算F1分数

F1可以被解读为 precisionrecall 的加权平均数,要得使F1得到高分,则必须使 precisionrecall 高。

  • F1 score 计算公式

    F1 = 2 * (precision * recall) / (precision + recall)
1
2
from sklearn.metrics import f1_score
f1_score(y_train_5, y_train_pred)
  • y_true : 1d array-like, or label indicator array / sparse matrix; 真实的目标值
  • y_pred : 1d array-like, or label indicator array / sparse matrix; 分类器返回的目标估计值
  • average : string, [None, ‘binary’ (default), ‘micro’, ‘macro’, ‘samples’, ‘weighted’]; 这个参数需要multiclass/multilabel 的目标。如果为空,每个类的分数被返回。否则,将执行下面的平均操作。
    keys description
    ‘binary’ Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.
    ‘micro’ Calculate metrics globally by counting the total true positives, false negatives and false positives.
    ‘macro’ Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
    ‘weighted’ Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.
    ‘samples’ Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from accuracy_score).

decision_function() 返回每个实例的F1分数(方便使用阈值)

1
2
3
4
5
6
7
y_scores = sgd_clf.decision_function([some_digit])

threshold = 0 # 设定阈值
y_some_digit_pred = (y_scores > threshold)

threshold = 200000 # 设定阈值
y_some_digit_pred = (y_scores > threshold)

通过交叉验证得到分数

1
2
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,
method="decision_function")

precision_recall_curve() 针对不同的概率阈值计算

precision ratio tp / (tp + fp); recall ratio tp / (tp + fn)

1
2
3
from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

参数
  • y_true : array, shape = [n_samples]; 在{-1, 1} or {0, 1}范围内,目标的二进制分类。
  • probas_pred : array, shape = [n_samples]; 估计概率或决策函数。

返回值
  • precision : array, shape = [n_thresholds + 1]
  • recall : array, shape = [n_thresholds + 1]
  • thresholds : array, shape = [n_thresholds <= len(np.unique(probas_pred))]

precision_score()

The precision is the ratio tp / (tp + fp).

1
precision_score(y_train_5, y_train_pred_90)

参数
  • y_true : 1d array-like, or label indicator array / sparse matrix; 正确的目标实际值。
  • y_pred : 1d array-like, or label indicator array / sparse matrix; 由分类器返回的目标估计值
    返回值
  • precision : float (if average is not None) or array of float, shape = [n_unique_labels]

recall_score()

The recall is the ratio tp / (tp + fn).

1
recall_score(y_train_5, y_train_pred_90)

参数
  • y_true : 1d array-like, or label indicator array / sparse matrix; 正确的目标实际值
  • y_pred : 1d array-like, or label indicator array / sparse matrix; 由分类器返回的目标估计值

返回值
  • recall : float (if average is not None) or array of float, shape = [n_unique_labels]

roc_curve() 计算ROC

注意:此实现仅限于二进制分类任务。

1
2
3
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

参数
  • y_true : array, shape = [n_samples]; 范围为{0, 1}或{-1, 1}的真实二元标签。如果标签不是二进制的,应明确给出pos_label参数。
  • y_score : array, shape = [n_samples]; 由一些分类器的”decision_function”返回,目标分数可以是positive class的概率估计、confidence values或非阈值化决策的量度。
  • pos_label : int or str, default=None; 标签被认为是positive,其他的则被认为是negative。

返回值
  • fpr : array, shape = [>2]
  • tpr : array, shape = [>2]
  • thresholds : array, shape = [n_thresholds]

roc_auc_score() 从预测分数计算ROC AUC

注意:此实现仅限于二进制分类任务或标签指示符格式的多标签分类任务。

1
2
3
from sklearn.metrics import roc_auc_score

roc_auc_score(y_train_5, y_scores)

参数
  • y_true : array, shape = [n_samples] or [n_samples, n_classes]; 二进制标签指示符中的真实二进制标签
  • y_score : array, shape = [n_samples] or [n_samples, n_classes]; 由一些分类器的”decision_function”返回,目标分数可以是positive class的概率估计、confidence values或非阈值化决策的量度。

返回值
  • auc : float

~classes_ 数组(array)

分类器训练的目标分类,列表存储在它的classes_属性中,顺序由值决定。例如

1
2
3
4
5
sgd_clf.classes_
# array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

sgd_clf.classes_[5]
# 5.0

强制SKlearn使用OvO策略或OvA策略

Sklearn对于使用二进制分类器训练出多项分类器会自动使用OvA策略,除了SVM分类器使用OvO策略。

如果想让SKlearn使用 one-versus-oneone-versus-all,可以使用 OneVsOneCLassifierOneVsRestClassifier类。

以强制使用OvO策略为例:

1
2
3
4
5
6
from sklearn.multiclass import OneVsOneClassifier
ovo_clf = OneVsOneClassifier(SGDClassifier(random_state=42))
ovo_clf.fit(X_train, y_train)
ovo_clf.predict([some_digit])
# Out:
# array([ 5.])

KNeighborsClassifier() KNN分类器

Classifier implementing the k-nearest neighbors vote.

1
2
3
4
5
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_jobs=-1, weights='distance', n_neighbors=4)
knn_clf.fit(X_train, y_train)

y_knn_pred = knn_clf.predict(X_test)
  • n_jobs : int, optional (default = 1); 运行neighbors search并行作业的数量。如果为-1,则作业数设置为CPU核心数。不影响fit方法
  • weights : str or callable, optional (default = ‘uniform’); 用于预测的权重函数,可能的值如下
    keys description
    ‘uniform’ uniform weights. All points in each neighborhood are weighted equally.
    ‘distance’ weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.
    [callable] a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.
  • n_neighbors : int, optional (default = 5); 相邻膈俞,默认使用kneighbors查询。

DummyClassifier() 使用简单规则来预测的分类器

这个分类器作为一个简单的基线比较其他(真正的)分类器是有用的。不要用它来解决真正的问题。

1
2
3
4
5
# 纯随机分类器
from sklearn.dummy import DummyClassifier
dmy_clf = DummyClassifier()
y_probas_dmy = cross_val_predict(dmy_clf, X_train, y_train_5, cv=3, method="predict_proba")
y_scores_dmy = y_probas_dmy[:, 1]
  • strategy : str, default=”stratified”; 用来产生预测的策略。在0.17版本中,现在支持事先使用参数的先验拟合策略。
    keys description
    “stratified” generates predictions by respecting the training set’s class distribution.
    “most_frequent” always predicts the most frequent label in the training set.
    “prior” always predicts the class that maximizes the class prior (like “most_frequent”) and predict_proba returns the class prior.
    “uniform” generates predictions uniformly at random.
    “constant” always predicts a constant label that is provided by the user. This is useful for metrics that evaluate a non-majority class
  • random_state : int, RandomState instance or None, optional, default=None
  • constant : int or str or array of shape = [n_outputs]; 作为constant策略的显式常量,该参数仅在constant策略中有用。
    Methods description
    fit(X, y[, sample_weight]) Fit the random classifier.
    get_params([deep]) Get parameters for this estimator.
    predict(X) Perform classification on test vectors X.
    predict_log_proba(X) Return log probability estimates for the test vectors X.
    predict_proba(X) Return probability estimates for the test vectors X.
    score(X, y[, sample_weight]) Returns the mean accuracy on the given test data and labels.
    set_params(**params) Set the parameters of this estimator.

accuracy_score() 精度分类评分

1
2
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_knn_pred)

参数
  • y_true : 1d array-like, or label indicator array / sparse matrix
  • y_pred : 1d array-like, or label indicator array / sparse matrix
  • normalize : bool, optional (default=True); 如果为False,则返回正确分类的样本数。否则,返回正确分类样本的一小部分。
  • sample_weight : array-like of shape = [n_samples], optional; 样本权重

返回值
  • score : float; 如果normalize == True,则返回正确分类的样本(float),否则返回正确分类的样本数量(int)。

SGDRegressor() SGD回归

线性模型通过使SGD正则化的经验损失最小化来拟合;SGD代表随机梯度下降

1
2
3
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(n_iter=50, penalty=None, eta0=0.1, random_state=42)
sgd_reg.fit(X, y.ravel())

参数
  • n_iter : int, optional;训练数据的通过次数(又称epochs)。默认为None。已弃用,将在0.21中删除。
  • max_iter : int, optional;训练数据的最大通过次数(也称为epochs)。替换n_iter参数。
  • penalty : str, ‘none’, ‘l2’, ‘l1’, or ‘elasticnet’;penalty术语也叫正则化,默认为l1
  • eta0 : double, optional;始学习率,默认为0.01。
  • warm_start : bool, optional;设置为True时,重新使用先前调用fit()的解决方案以初始化,否则,只需擦除以前的解决方案。
  • random_state : int, RandomState instance or None, optional (default=None);随机种子。

属性
keys description
coef_ : array, shape (n_features,) Weights assigned to the features.
intercept_ : array, shape (1,) The intercept term.
average_coef_ : array, shape (n_features,) Averaged weights assigned to the features.
average_intercept_ : array, shape (1,) The averaged intercept term.
n_iter_ : int The actual number of iterations to reach the stopping criterion.
Methods description
densify() Convert coefficient matrix to dense array format.
fit(X, y[, coef_init, intercept_init, …]) Fit linear model with Stochastic Gradient Descent.
get_params([deep]) Get parameters for this estimator.
partial_fit(X, y[, sample_weight]) Fit linear model with Stochastic Gradient Descent.
predict(X) Predict using the linear model
score(X, y[, sample_weight]) Returns the coefficient of determination R^2 of the prediction.
set_params(args, *kwargs)
sparsify() Convert coefficient matrix to sparse format.

PolynomialFeatures() 生成多项式和交互特征

生成一个新的特征矩阵,该特征矩阵由度数小于或等于指定度的特征的所有多项式组合组成。 例如,如果输入样本是二维的并且形式为[a,b],则2次多项式特征是[1,a,b,a ^ 2,ab,b ^ 2]。

1
2
3
4
5
from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
X[0]
# array([-0.75275929])

参数
  • degree : integer;多项式特征度数,默认值为2。
  • include_bias : boolean;如果为True(默认值),则包含一个偏置列,即所有多项式幂为0(作为线性模型的截距项)。
  • interaction_only : boolean, default = False;如果为True,则只产生相互特征。

属性
keys description
powers_ : array, shape (n_output_features, n_input_features) powers_[i, j] is the exponent of the jth input in the ith output.
n_input_features_ : int The total number of input features.
n_output_features_ : int The total number of polynomial output features. The number of output features is computed by iterating over all suitably sized combinations of input features.

Ridge() 具有L2正则化的线性最小二乘

这个模型解决一个使用最小二乘loss函数,使用l2-norm正则函数的回归模型。这个估计器内置了对多变量回归的支持(例如:y是一个形状为[n_samples, n_targets]二维数组)。

1
2
3
4
5
from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=1, solver="cholesky", random_state=42)
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])
# array([[ 1.55071465]])
1
2
3
4
ridge_reg = Ridge(alpha=1, solver="sag", random_state=42)
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])
# array([[ 1.5507201]])

参数
  • alpha : {float, array-like}, shape (n_targets);正则化强度,必须为正的float类型。
  • random_state : int, RandomState instance or None, optional, default None
  • fit_intercept : boolean;是否计算此模型的截距。如果设置为false,则计算中将不使用截距(例如,数据预期已居中)。
  • solver : {‘auto’, ‘svd’, ‘cholesky’, ‘lsqr’, ‘sparse_cg’, ‘sag’, ‘saga’};用于计算例程的求解器。详细介绍:
    keys description
    ‘auto’ chooses the solver automatically based on the type of data.
    ‘svd’ uses a Singular Value Decomposition of X to compute the Ridge coefficients. More stable for singular matrices than ‘cholesky’.
    ‘cholesky’ uses the standard scipy.linalg.solve function to obtain a closed-form solution.
    ‘sparse_cg’ uses the conjugate gradient solver as found in scipy.sparse.linalg.cg. As an iterative algorithm, this solver is more appropriate than ‘cholesky’ for large-scale data (possibility to set tol and max_iter).
    ‘lsqr’ uses the dedicated regularized least-squares routine scipy.sparse.linalg.lsqr. It is the fastest but may not be available in old scipy versions. It also uses an iterative procedure.
    ‘sag’ uses a Stochastic Average Gradient descent, and ‘saga’ uses its improved, unbiased version named SAGA. Both methods also use an iterative procedure, and are often faster than other solvers when both n_samples and n_features are large. Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from sklearn.preprocessing.

属性
keys description
coef_ : array, shape (n_features,) or (n_targets, n_features) Weight vector(s).
intercept_ : float array, shape = (n_targets,) Independent term in decision function. Set to 0.0 if fit_intercept = False.
n_iter_ : array or None, shape (n_targets,) Actual number of iterations for each target. Available only for sag and lsqr solvers. Other solvers will return None.New in version 0.17.

Lasso() 用L1预先正则化的线性模型

Lasso优化目标
(1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1

1
2
3
4
from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)
lasso_reg.predict([[1.5]])
  • alpha : float, optional;乘以L1项的常数。
  • fit_intercept : boolean;是否计算这个模型的截距,如果设置为False,则不会计算截距(例如数据已居中)。
  • random_state : int, RandomState instance or None, optional, default None

ElasticNet() 结合L1、L2作为预先正则化的线性回归

1
2
3
4
from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=42)
elastic_net.fit(X, y)
elastic_net.predict([[1.5]])
  • alpha : float, optional;乘以惩罚项的常数。
  • l1_ratio : float;ElasticNet混合参数。
  • random_state : int, RandomState instance or None, optional, default None

clone() 用相同的参数构造一个新的估计器。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from sklearn.base import clone
sgd_reg = SGDRegressor(n_iter=1, warm_start=True, penalty=None,
learning_rate="constant", eta0=0.0005, random_state=42)

minimum_val_error = float("inf")
best_epoch = None
best_model = None
for epoch in range(1000):
sgd_reg.fit(X_train_poly_scaled, y_train) # continues where it left off
y_val_predict = sgd_reg.predict(X_val_poly_scaled)
val_error = mean_squared_error(y_val_predict, y_val)
if val_error < minimum_val_error:
minimum_val_error = val_error
best_epoch = epoch
best_model = clone(sgd_reg)
  • estimator : estimator object, or list, tuple or set of objects;要复制的估计器对象

datasets() 下载常用的数据集

details

下载鸢尾花数据集

1
2
from sklearn import datasets
iris = datasets.load_iris()

LogisticRegression() Logistic回归

1
2
3
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X, y)
  • penalty : str, ‘l1’ or ‘l2’, default: ‘l2’
  • dual : bool, default: False;Dual或原始公式,Dual公式只适用于使用L2惩罚的线性求解器,当 n_samples > n_features 优先使用dual=False。
  • tol : float, default: 1e-4;对停止的容忍标准。
  • C : float, default: 1.0; 正则化强度的反转,必须是一个正值,就像在支持向量机中一样,较小的值指定更强的正则化。
本文作者 : HeoLis
原文链接 : http://ishero.net/SKlearn%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0.html
版权声明 : 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明出处!

学习、记录、分享、获得

微信扫一扫, 向我投食

微信扫一扫, 向我投食