拟合（Curve fitting）与回归（Regression analysis）

给定下面一组数据：时间10:0011:0012:0013:0014:0015：00温度12℃15℃17℃20℃25℃18℃如果要预测某个时间的温度值，首先需要利用已知数据对时间-温度进行建模或者说建立温度与时间的关系函数。为建立这样一个模型，通常有两种方法：差值方法：用一个函数（一般为多项式函数）来近似代替数据列表，并要求多项式经过列表中给定的数...

文章共2,646字 · 阅读需要大约9分钟

一键AI生成摘要，助你高效阅读

问答

TensorME

32450人浏览 · 2019-06-11 21:11:53

TensorME · 2019-06-11 21:11:53 发布

一.Curve Fitting

给定下面一组数据：

时间	10:00	11:00	12:00	13:00	14:00	15：00
温度	12℃	15℃	17℃	20℃	25℃	18℃

如果要预测某个时间的温度值，首先需要利用已知数据对时间-温度进行建模或者说建立温度与时间的关系函数。为建立这样一个模型，通常有两种方法：

差值方法：用一个函数（一般为多项式函数）来近似代替数据列表，并要求多项式经过列表中给定的数据点，插值曲线要经过数据点。
拟合方法： 仅要求在用函数表示列表中数据关系时，其误差在某种度量意义下最小，不要求完全经过数据点。

从几何意义上将，拟合是给定了空间中的一些点，找到一个已知形式未知参数的连续曲面来最大限度地逼近这些点；而插值是找到一个(或几个分片光滑的)连续曲面来穿过这些点。因此拟合可以用于外推预测（可以预测17:00的温度值），而差值一般用于求解差值空间里面的未知函数值（10：00 到15：00 之间任意时刻的值）

（此处还有一个概念函数逼近，其含义指用一个简单的函数近似表示一个复杂的函数，其前提是已知了复杂函数的表达式，而不是函数上面的离散的点）

从上面的介绍可知：拟合是一种数据建模的方法。

以下是维基百科中的曲线拟合的定义。

Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points,possibly subject to constraints.Curve fitting can involve either interpolation, where an exact fit to the data is required, or smoothing, in which a “smooth” function is constructed that approximately fits the data.

曲线拟合是一个求解曲线模型或数学函数的过程，该曲线或数学函数在某种约束下最优表征该组数据点。曲线拟合可能要求拟合结果精确穿过数据（插值）或拟合的结果足够“平滑”。

A related topic is regression analysis, which focuses more on questions of statistical inference such as how much uncertainty is present in a curve that is fit to data observed with random errors. Fitted curves can be used as an aid for data visualization,to infer values of a function where no data are available,and to summarize the relationships among two or more variables. Extrapolation refers to the use of a fitted curve beyond the range of the observed data,and is subject to a degree of uncertainty since it may reflect the method used to construct the curve as much as it reflects the observed data.

与拟合相关的一个概念是回归分析，它更多地关注统计推断的问题，例如分析拟合一组带有随机误差数据的曲线中的不确定性。拟合曲线可以用于数据可视化或推断没有数据可用的函数值，并总结两个或多个变量之间的关系。外推是指使用超出观测数据范围的拟合曲线，并且存在一定程度的不确定性。

二. Regression Analysis

在统计学中，回归分析（regression analysis）指的是确定两种或两种以上变量间相互依赖的定量关系的一种统计分析方法。回归分析按照涉及的变量的多少，分为一元回归和多元回归分析；按照因变量的多少，可分为简单回归分析和多重回归分析；按照自变量和因变量之间的关系类型，可分为线性回归分析和非线性回归分析。
在大数据分析中，回归分析是一种预测性的建模技术，它研究的是因变量（目标）和自变量（预测器）之间的关系。这种技术通常用于预测分析，时间序列模型以及发现变量之间的因果关系。例如，司机的鲁莽驾驶与道路交通事故数量之间的关系，最好的研究方法就是回归。

维基百科中对回归分析的解释：
Regression is a statistical measurement used in finance, investing and other disciplines that attempts to determine the strength of the relationship between one dependent variable (usually denoted by Y) and a series of other changing variables (known as independent variables).
回归是一种用于金融，投资和其他学科的统计测量方法，试图确定一个因变量（通常用Y表示）和一系列其他变化变量（称为自变量）之间关系的强度。

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables (or ‘predictors’). More specifically, regression analysis helps one understand how the typical value of the dependent variable (or ‘criterion variable’) changes when any one of the independent variables is varied, while the other independent variables are held fixed.

在统计建模中，回归分析是一组用于估计变量之间关系的统计过程。当关注点在于因变量与一个或多个自变量（或“预测变量”）之间的关系时，它包括许多用于建模和分析多个变量的技术。更具体地说，回归分析有助于理解当任何一个自变量变化而其他自变量保持固定时，因变量（或“标准变量”）的典型值如何变化。

Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable.
回归分析广泛用于预测和预报，其中它的应用与机器学习领域有很大的重叠。回归分析还用于了解独立变量中哪些与因变量相关，并探索这些关系的形式。在受限制的情况下，回归分析可用于推断独立变量和因变量之间的因果关系。然而，这可能导致不切实际或错误的关系，因此建议谨慎。

In a narrower sense, regression may refer specifically to the estimation of continuous response (dependent) variables, as opposed to the discrete response variables used in classification.The case of a continuous dependent variable may be more specifically referred to as metric regression to distinguish it from related problems.
从狭义上讲，回归可能特指连续响应（依赖）变量的估计，而不是分类中使用的离散响应变量。连续因变量的情况可以更具体地称为度量回归，以将其与相关问题区分开来。

三. 拟合与回归的区别

回归是一种数据分析方法；拟合是一种数据建模方法；拟合侧重于调整曲线的参数，使得与数据相符。而回归重在研究两个变量或多个变量之间的关系。它可以用拟合的手法来研究两个变量的关系，以及出现的误差。

四. 机器学习中的回归

此处的回归更接近拟合建模的概念，回归问题与分类问题的区别在于输出是连续的变量还是不同的类别，机器学习中的回归与分类均可以用于预测。
A regression problem is when the output variable is a real or continuous value, such as “salary” or “weight”. Many different models can be used, the simplest is the linear regression. It tries to fit data with the best hyper-plane which goes through the points.

A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease”. A classification model attempts to draw some conclusion from observed values. Given one or more inputs a classification model will try to predict the value of one or more outcomes.

For example, when filtering emails “spam” or “not spam”, when looking at transaction data, “fraudulent”, or “authorized”. In short Classification either predicts categorical class labels or classifies data (construct a model) based on the training set and the values (class labels) in classifying attributes and uses it in classifying new data. There are a number of classification models. Classification models include logistic regression, decision tree, random forest, gradient-boosted tree, multilayer perceptron, one-vs-rest, and Naive Bayes.

附：“回归”名词的由来

引自汪荣伟先生主编的《经济应用数学》
高尔顿（Frramcia Galton,1882-1911）早年在剑桥大学学习医学，但医生的职业对他并无吸引力，后来他接受了一笔遗产，这使他可以放弃医生的生涯，并与 1850－1852年期间去非洲考察，他所取得的成就使其在1853年获得英国皇家地理学会的金质奖章。此后他研究过多种学科（气象学、心理学、社会学、教育学和指纹学等），在1865年后他的主要兴趣转向遗传学，这也许是受他表兄达尔文的影响。
从19世纪80年代高尔顿就开始思考父代和子代相似，如身高、性格及其它种种特制的相似性问题。于是他选择了父母平均身高X与其一子身高Y的关系作为研究对象。他观察了1074对父母及每对父母的一个儿子，将结果描成散点图，发现趋势近乎一条直线。总的来说是父母平均身高X增加时，其子的身高Y也倾向于增加，这是意料中的结果。但有意思的是高尔顿发现这1074对父母平均身高的平均值为68 英寸（英国计量单位，1 英寸=2.54cm）时，1074个儿子的平均身高为69 英寸，比父母平均身高大1 英寸，于是他推想，当父母平均身高为64 英寸时，1074个儿子的平均身高应为64+1=65 英寸；若父母的身高为72 英寸时，他们儿子的平均身高应为72=1=73 英寸，但观察结果确与此不符。高尔顿发现前一种情况是儿子的平均身高为67 英寸，高于父母平均值达3 英寸，后者儿子的平均身高为71英寸，比父母的平均身高低1 英寸。
高尔顿对此研究后得出的解释是自然界有一种约束力，使人类身高在一定时期是相对稳定的。如果父母身高（或矮了），其子女比他们更高（矮），则人类身材将向高、矮两个极端分化。自然界不这样做，它让身高有一种回归到中心的作用。例如，父母平均身高 72 英寸，这超过了平均值68英寸，表明这些父母属于高的一类，其儿子也倾向属于高的一类（其平均身高71 英寸大于子代69 英寸），但不像父母离子代那么远（71-69<72-68）。反之，父母平均身高64 英寸，属于矮的一类，其儿子也倾向属于矮的一类（其平均67 英寸，小于子代的平均数69 英寸），但不像父母离中心那么远（69 -67< 68-64）。
因此，身高有回归于中心的趋势，由于这个性质，高尔顿就把“回归”这个词引进到问题的讨论中，这就是“回归”名称的由来，逐渐背后人沿用成习了。命名的统计学家是想说，这些点都围绕在一条看不见的直线，直线周围的点若偏离的大了感觉就有回归直线，向直线靠拢的趋势。

参考：
1.https://www.geeksforgeeks.org/regression-classification-supervised-machine-learning/
2.https://en.wikipedia.org/wiki/Curve_fitting
3.https://en.wikipedia.org/wiki/Regression_analysis
4.https://blog.csdn.net/denghecsdn/article/details/77334160