线性回归

线性回归应用场景

房价预测
销售额度预测
金融：贷款额度预测、利用线性回归以及系数分析因子

什么是线性回归

线性回归(Linear regression)是利用回归方程(函数)对一个或多个自变量(特征值)和因变量(目标值)之间关系进行建模的一种分析方式。

特点：只有一个自变量的情况称为单变量回归，大于一个自变量情况的叫做多元回归线性回归公式

通用公式：

h(w) = w_1x_1 + w_2x_2 + w_3x_3... + b = w^Tx +b

其中w,x 可以理解为矩阵：

w= \begin{pmatrix}
{b}\\
{w_{1}}\\
{w_{2}}\\
\end{pmatrix},
x = \begin{pmatrix}
{1}\\
{x_{1}}\\
{x_{2}}\\
\end{pmatrix}

那么怎么理解呢？我们来看几个例子

期末成绩：0.7×考试成绩+0.3×平时成绩
房子价格 = 0.02×中心区域的距离 + 0.04×城市一氧化氮浓度 + (-0.12×自住房平均房价) + 0.254×城镇犯罪率

上面两个例子，我们看到特征值与目标值之间建立的一个关系，这个可以理解为回归方程。

损失函数

J(\theta) = (h_w(x_1)-y_1)^2 + (h_w(x_1)-y_1)^2 + ... + (h_w(x_m) - ym)^2
 = \sum _{i=1}{^m}(h_w(x_i)-y_i)^2

y_i为第i个训练样本的真实值
h(x_i)为第i个训练样本特征值组合预测函数
又称最小二乘法

如何去减少这个损失，使我们预测的更加准确些？既然存在了这个损失，我们一直说机器学习有自动学习的功能，在线性回归这里更是能够体现。这里可以通过一些优化方法去优化（其实是数学当中的求导功能）回归的总损失！！！

优化算法

如何去求模型当中的W，使得损失最小？（目的是找到最小损失对应的W值）

线性回归经常使用的两种优化算法 正规方程 和 梯度下降

正规方程

w = (X^T X) ^{-1}X^Ty

理解：X为特征值矩阵，y为目标值矩阵。直接求到最好的结果
缺点：当特征过多过复杂时，求解速度太慢并且得不到结果

梯度下降 (Gradient Descent)

w1 := w1 -  \alpha \frac{\delta cost(w0+w1x1) }{\delta w1}

w0 := w0 -  \alpha \frac{\delta cost(w0+w1x1) }{\delta w1}

理解：α为学习速率，需要手动指定（超参数），α旁边的整体表示方向
沿着这个函数下降的方向找，最后就能找到山谷的最低点，然后更新W值
使用：面对训练数据规模十分庞大的任务，能够找到较好的结果

线性回归API

sklearn.linear_model.LinearRegression(fit_intercept=True)
- 通过正规方程优化
- fit_intercept：是否计算偏置
- LinearRegression.coef_：回归系数
- LinearRegression.intercept_：偏置

sklearn.linear_model.SGDRegressor(loss="squared_loss", fit_intercept=True, learning_rate ='invscaling', eta0=0.01)

- SGDRegressor类实现了随机梯度下降学习，它支持不同的loss函数和正则化惩罚项来拟合线性回归模型。
- loss:损失类型
- loss="squared_loss": 普通最小二乘法
- fit_intercept：是否计算偏置
- learning_rate : string, optional
- 学习率填充
    - 'constant': eta = eta0
    - 'optimal': eta = 1.0 / (alpha * (t + t0)) [default]
    - 'invscaling': eta = eta0 / pow(t, power_t)
        - power_t=0.25:存在父类当中
    - 对于一个常数值的学习率来说，可以使用learning_rate='constant' ，并使用eta0来指定学习率。
- SGDRegressor.coef_：回归系数
- SGDRegressor.intercept_：偏置

sklearn提供给我们两种实现的API，可以根据选择使用

回归性能评估

均方误差(Mean Squared Error)MSE)评价机制：

MSE := \frac{1}{m} \sum_{i=1}^m (y^i - \stackrel{-}{y})^2

注：y^i为预测值，¯y为真实值
sklearn.metrics.mean_squared_error(y_true, y_pred)
均方误差回归损失
y_true:真实值
y_pred:预测值
return:浮点数结果

梯度下降的优化方法GD、SGD、SAG

GD

梯度下降(Gradient Descent)，原始的梯度下降法需要计算所有样本的值才能够得出梯度，计算量大，所以后面才有会一系列的改进。

SGD

随机梯度下降(Stochastic gradient descent)是一个优化方法。它在一次迭代时只考虑一个训练样本。

SGD的优点是：

高效
容易实现
SGD的缺点是：
SGD需要许多超参数：比如正则项参数、迭代数。
SGD对于特征标准化是敏感的。

SAG

随机平均梯度法(Stochasitc Average Gradient)，由于收敛的速度太慢，有人提出SAG等基于梯度下降的算法

Scikit-learn：SGDRegressor、岭回归、逻辑回归等当中都会有SAG优化

案例代码

# 线性模型包括线性关系和非线性关系两种
# 线性模型包括参数一次幂和自变量一次幂
# 线性关系一定是线性模型, 反之不一定
# 优化方法有两种: 一种是正规方程, 第二种是梯度下降

# 这部分用来训练预测房价
from sklearn.linear_model import LinearRegression, SGDRegressor, Ridge, RidgeCV
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error  # 均方误差

def load_data():
    boston_data = load_boston()
    print("特征数量为:(样本数,特征数)", boston_data.data.shape)
    x_train, x_test, y_train, y_test = train_test_split(boston_data.data,
                                                        boston_data.target, random_state=22)
    return x_train, x_test, y_train, y_test


# 正规方程
def linear_Regression():
    """
    正规方程的优化方法
    不能解决拟合问题
    一次性求解
    针对小数据
    :return:
    """
    x_train, x_test, y_train, y_test = load_data()
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    estimator = LinearRegression()
    estimator.fit(x_train, y_train)

    print("正规方程_权重系数为: ", estimator.coef_)
    print("正规方程_偏置为:", estimator.intercept_)

    y_predict = estimator.predict(x_test)
    error = mean_squared_error(y_test, y_predict)
    print("正规方程_房价预测:", y_predict)
    print("正规方程_均分误差:", error)
    return None


# 梯度下降
def linear_SGDRegressor():
    """
    梯度下降的优化方法
    迭代求解
    针对大数据
    :return:
    """
    x_train, x_test, y_train, y_test = load_data()
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    # 建议看下这个函数的api, 这些值都是默认值
    # estimator = SGDRegressor(loss="squared_loss", fit_intercept=True, eta0=0.01,
    #                          power_t=0.25)

    estimator = SGDRegressor(learning_rate="constant", eta0=0.01, max_iter=10000)
    # estimator = SGDRegressor(penalty='l2', loss="squared_loss")  # 这样设置就相当于岭回归, 但是建议用Ridge方法
    estimator.fit(x_train, y_train)

    print("梯度下降_权重系数为: ", estimator.coef_)
    print("梯度下降_偏置为:", estimator.intercept_)

    y_predict = estimator.predict(x_test)
    error = mean_squared_error(y_test, y_predict)
    print("梯度下降_房价预测:", y_predict)
    print("梯度下降_均分误差:", error)

    return None


def linear_Ridge():
    """
    Ridge: 岭回归方法
    :return:
    """
    x_train, x_test, y_train, y_test = load_data()
    transfer = StandardScaler()  # 建议使用标准化处理数据
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    estimator = Ridge(max_iter=10000, alpha=0.5)  # 岭回归
    # estimator = RidgeCV(alphas=[0.1, 0.2, 0.3, 0.5])  # 加了交叉验证的岭回归
    estimator.fit(x_train, y_train)

    print("岭回归_权重系数为: ", estimator.coef_)
    print("岭回归_偏置为:", estimator.intercept_)

    y_predict = estimator.predict(x_test)
    error = mean_squared_error(y_test, y_predict)
    print("岭回归_房价预测:", y_predict)
    print("岭回归_均分误差:", error)

    return None


if __name__ == '__main__':
    linear_Regression()
    linear_SGDRegressor()
    linear_Ridge()

结果

特征数量为:(样本数,特征数) (506, 13)
正规方程_权重系数为:  [-0.64817766  1.14673408 -0.05949444  0.74216553 -1.95515269  2.70902585
 -0.07737374 -3.29889391  2.50267196 -1.85679269 -1.75044624  0.87341624
 -3.91336869]
正规方程_偏置为: 22.62137203166228
正规方程_房价预测: [28.22944896 31.5122308  21.11612841 32.6663189  20.0023467  19.07315705
09772798 19.61400153 19.61907059 32.87611987 20.97911561 27.52898011
54701758 19.78630176 36.88641203 18.81202132  9.35912225 18.49452615
66499315 24.30184448 19.08220837 34.11391208 29.81386585 17.51775647
91026707 26.54967053 34.71035391 27.4268996  19.09095832 14.92742976
86877936 15.88271775 37.17548808  7.72101675 16.24074861 17.19211608
42140081 20.0098852  40.58481466 28.93190595 25.25404307 17.74970308
76446932  6.87996052 21.80450956 25.29110265 20.427491   20.4698034
25330064 26.12442519  8.48268143 27.50871869 30.58284841 16.56039764
38919181 35.54434377 32.29801978 21.81298945 17.60263689 22.0804256
49262401 24.10617033 20.1346492  38.5268066  24.58319594 19.78072415
93429891  6.75507808 42.03759064 21.9215625  16.91352899 22.58327744
76440704 21.3998946  36.89912238 27.19273661 20.97945544 20.37925063
3536439  22.18729123 31.13342301 20.39451125 23.99224334 31.54729547
74581308 20.90199941 29.08225233 21.98331503 26.29101202 20.17329401
49225305 24.09171045 19.90739221 16.35154974 15.25184758 18.40766132
83797801 16.61703662 20.89470344 26.70854061 20.7591883  17.88403312
28656105 23.37651493 21.64202047 36.81476219 15.86570054 21.42338732
81366203 33.74086414 20.61688336 26.88191023 22.65739323 17.35731771
67699248 21.65034728 27.66728556 25.04691687 23.73976625 14.6649641
17700342  3.81620663 29.18194848 20.68544417 22.32934783 28.01568563
58237108]
正规方程_均分误差: 20.6275137630954
特征数量为:(样本数,特征数) (506, 13)
梯度下降_权重系数为:  [-1.04364677  1.04012133 -0.32795148  1.50810169 -1.80799894  3.09581791
 -0.2448096  -3.53801925  2.30450936 -1.97822406 -2.02534199  1.16418724
 -4.57273239]
梯度下降_偏置为: [22.84757248]
梯度下降_房价预测: [29.68343132 33.4649779  21.65534113 37.2744436  20.2168242  18.04268748
59750332 20.06731514 20.00246026 34.58427777 21.31221344 27.30835118
69640661 19.56967973 39.18122401 18.80184486  8.2635644  18.38495869
70417996 25.2530655  17.99073665 36.43375433 34.00286866 15.41240004
28675323 27.76783217 36.7588099  28.86921782 17.23043897 14.94613895
72206925 14.58519033 39.74531945  1.17929187 15.79689483 15.24682753
88288234 18.79504169 46.57756401 30.9978196  26.38109308 16.25789249
50267386  3.27378129 21.02810353 26.12907329 24.02543035 20.40637932
42557732 25.20037357  6.40268287 28.92483902 34.9999772  13.97527427
69462171 38.11385304 33.62010227 22.88758356 17.41771121 22.68927359
99646882 24.89873183 20.66010263 41.27511165 26.10099596 18.57889113
33387277  2.76703102 47.97552732 22.43924355 14.26382243 23.72700453
14770686 22.07638935 39.70975664 28.56167544 21.86372232 19.92235624
82787882 23.7074095  33.27924587 20.85367214 27.4932459  33.15910854
85949469 20.03189974 30.58283679 23.02442435 27.39497594 19.64573775
01073143 27.23507187 18.82268992 11.08717524 12.69571468 17.36183044
82223584 14.28948517 19.69879807 27.95863231 19.37527804 16.43066936
77348089 23.95527634 21.54063409 42.00723682 15.49339583 22.4190624
7342756  37.27889691 20.99417894 27.31720563 26.4511438  17.42328726
43170768 22.11483073 28.86388862 25.99491102 24.4952229  12.63038333
431302   -0.59333438 30.86738312 19.96951551 23.05014807 29.56912465
60285661]
梯度下降_均分误差: 24.38311211436002
特征数量为:(样本数,特征数) (506, 13)
岭回归_权重系数为:  [-0.64193209  1.13369189 -0.07675643  0.74427624 -1.93681163  2.71424838
 -0.08171268 -3.27871121  2.45697934 -1.81200596 -1.74659067  0.87272606
 -3.90544403]
岭回归_偏置为: 22.62137203166228
岭回归_房价预测: [28.22536271 31.50554479 21.13191715 32.65799504 20.02127243 19.07245621
10832868 19.61646071 19.63294981 32.85629282 20.99521805 27.5039205
55295503 19.79534148 36.87534254 18.80312973  9.39151837 18.50769876
66823994 24.3042416  19.08011554 34.10075629 29.79356171 17.51074566
89376386 26.53739131 34.68266415 27.42811508 19.08866098 14.98888119
85920064 15.82430706 37.18223651  7.77072879 16.25978968 17.17327251
44393003 19.99708381 40.57013125 28.94670553 25.25487557 17.75476957
77349313  6.87948646 21.78603146 25.27475292 20.4507104  20.47911411
25121804 26.12109499  8.54773286 27.48936704 30.58050833 16.56570322
40627771 35.52573005 32.2505845  21.8734037  17.61137983 22.08222631
49713296 24.09419259 20.15174912 38.49803353 24.63926151 19.77214318
95001219  6.7578343  42.03931243 21.92262496 16.89673286 22.59476215
75560357 21.42352637 36.88420001 27.18201696 21.03801678 20.39349944
35646095 22.27374662 31.142768   20.39361408 23.99587493 31.54490413
76213545 20.8977756  29.0705695  21.99584672 26.30581808 20.10938421
47834262 24.08620166 19.90788343 16.41215513 15.26575844 18.40106165
82285704 16.61995784 20.87907604 26.70640134 20.75218143 17.88976552
27287641 23.36686439 21.57861455 36.78815164 15.88447635 21.47747831
80013402 33.71367379 20.61690009 26.83175792 22.69265611 17.38149366
67395385 21.67101719 27.6669245  25.06785897 23.73251233 14.65355067
19441045  3.81755887 29.1743764  20.68219692 22.33163756 28.01411044
55668351]
岭回归_均分误差: 20.641771606180907

线性回归应用场景​

什么是线性回归​

损失函数​

优化算法​

正规方程​

梯度下降 (Gradient Descent)​

线性回归API​

回归性能评估​

梯度下降的优化方法GD、SGD、SAG​

GD​

SGD​

SAG​

案例 代码​