Logistic Regression Formulas

Sigmoid, Log-Loss (Binary Cross-Entropy) & Gradient Descent
$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$
Sigmoid Function

What it is: Transforms any number into a probability between 0 and 1.

Why it's useful:

  • Converts unbounded linear output to probability range [0, 1].
  • Smooth and differentiable everywhere (gradient descent works).
  • Output can be interpreted as P(y=1|x).

Note: When z=0, output is exactly 0.5 (decision boundary).

$$ z = \beta_0 + \beta_1 x $$
Linear Combination (z)
What it is: The input to the sigmoid function, a weighted sum of features.
  • \(x\): Input feature (e.g., hours studied)
  • \(\beta_0\) (or \(b\)): Intercept / Bias
  • \(\beta_1\) (or \(m\), \(w\)): Slope / Weight

Note: Same as linear regression, but fed into sigmoid.

$$ L = -[y \cdot \ln(\hat{y}) + (1-y) \cdot \ln(1-\hat{y})] $$
Log-Loss (Single Point)

What it is: Loss for one prediction. Penalizes confident wrong predictions heavily.

  • If \(y=1\): Loss = \(-\ln(\hat{y})\). Small when \(\hat{y}\) is high.
  • If \(y=0\): Loss = \(-\ln(1-\hat{y})\). Small when \(\hat{y}\) is low.

Note: Uses \(\ln\) (natural log). Loss approaches infinity as prediction approaches wrong extreme.

$$ J = -\frac{1}{n}\sum_{i=1}^{n}\left[ y_i \ln(\hat{y}_i) + (1-y_i)\ln(1-\hat{y}_i) \right] $$
Binary Cross-Entropy (BCE)

What it is: Average Log-Loss across all training examples. The cost function to minimize.

Why it's useful:

  • Creates a convex surface (single global minimum).
  • Gradient descent always converges to optimal solution.
  • Derived from Maximum Likelihood Estimation (MLE).

Also known as: Log-Loss, Negative Log-Likelihood.

$$ J_{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 $$
MSE with Sigmoid (Avoid!)

What it is: Mean Squared Error applied to sigmoid output.

Why it's BAD:

  • Creates a non-convex surface with local minima.
  • Gradient vanishes when sigmoid saturates (very slow learning).
  • Gradient descent may get stuck in wrong place.

Rule: Use MSE for regression, Log-Loss for classification.

$$ \frac{\partial J}{\partial w} = \frac{1}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i) \cdot x_i $$
Gradient for Log-Loss

What it is: Direction to update weights to minimize loss.

Why it's useful:

  • Surprisingly simple: error \(\times\) input.
  • Same form as linear regression gradient!
  • No vanishing gradient problem (unlike MSE+Sigmoid).

Update rule: \(w := w - \alpha \cdot \frac{\partial J}{\partial w}\)

$$ \hat{y} = \begin{cases} 1 & \text{if } \sigma(z) \geq 0.5 \\ 0 & \text{if } \sigma(z) < 0.5 \end{cases} $$
Classification Decision

What it is: Converting probability to class label using threshold.

Why it's useful:

  • Default threshold 0.5 = equal treatment of classes.
  • Threshold can be adjusted for imbalanced datasets.
  • Probability output allows confidence interpretation.

Note: Decision boundary is where \(z = 0\).