Mar 2, 2016

Difference between SVM and Logistic Regression -- actually not that much different

It is always a very hot topic --  what is the difference between SVM and logistic. The answer is actually too easy too list, isn't it?

The two algorithms are developed from very different intuitions: SVM aims to separate the positives and negatives with the maximal margin in the high dimensional linear space, while logistic regression tries to estimate the underlying probability for a positive observation given the independent variables.

The approaches to estimate the coefficients are also very different, SVM is solved by quadratic programming, while logistic regression is typically estimated by maximal likelihood estimation (MLE).

SVM is typically working with kernel tricks, which enable it to fitting the non-linear hyperspace.  Though logistic regression can also work with kernel but is less popular.

On the other hand , logistic regression can produce the probabilistic output, i.e., the positive probability given independent variables. This could be very meaningful in later analysis in many business cases. Moreover, because of its deep roots in statistics research, the significance testing and feature selection methods are well-studied. Actually, many ideas and formulas in the significance testing and feature selection methods for the logistic regression can be modified to the SVM situation as well. For example, forward backward selection can be easily modified to the context of SVM, and both of them can be regularized via \(L_1\) norm penalty. So when the results require more statistical soundness, I would prefer logistic regression. It does not mean we don't have any results for SVM but just less "well-known".


Similarities between SVM and logistic regression

But now I would like to talk about the similarity between them: they actually belongs to the same family of optimization problem but with different loss function.

For SVM (with soft margin), the optimization problem isto minimize the follows
\begin{equation}J_{\mathrm{SVM}}=\frac{1}{n}\sum_{i=1}^n\bigl[(1-y_i(\boldsymbol\beta\cdot \boldsymbol x_i+\beta_0)\bigr]_{+}+\lambda\|\boldsymbol\beta\|^2, \text{where } y_i\in\{1, -1\}.\end{equation}
The part in the summation is the hinge loss (\([\ ]_+\) means flooring at 0). So SVM is actually minimizing the total hinge loss.

For logistic regression, the optimization problem is to minimize the follows
\begin{equation}\label{opt-lr-1}J_{\mathrm{RL}}=\frac{1}{n}\sum_{i=1}^n\ln\left[1+e^{-y_i(\boldsymbol\beta\cdot \boldsymbol x_i+\beta_0)}\right], \text{where } y_i\in\{1, -1\}.\end{equation}
The loss function here is known as the (negative) binomial log-likelihood or simply logistic loss.

Remark: the label \(y_i\) is very critical to the form of the loss function, though does not impact the mathematical meaning. In the formula above, as in the most of the cases, I used \(\{1, -1\}\) to indicate the positive and negative observations. Using \(\{1, 0\}\) to indicate the positives and negatives is also popular, especially for logistic regression. If using \(y_i\in\{1, 0\}\), the optimization in \eqref{opt-lr-1} would become to maximize the follows:
\begin{equation}\label{opt-lr-2}\frac{1}{n}\sum_{i=1}^n\left[y_i\ln\left(\frac{1}{1+e^{-(\boldsymbol\beta\cdot \boldsymbol x_i+\beta_0)}}\right)+(1-y_i)\ln\left(\frac{1}{1+e^{\boldsymbol\beta\cdot \boldsymbol x_i+\beta_0}}\right)\right], \text{where } y_i\in\{1, 0\}.\end{equation}
\eqref{opt-lr-2} is exactly the MLE for logistic regression.