Logistic Regression Formulas - Sigmoid, Log-Loss, BCE Explained

$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

Sigmoid Function

What it is: Transforms any number into a probability between 0 and 1.

Why it's useful:

Note: When z=0, output is exactly 0.5 (decision boundary).

$$ z = \beta_0 + \beta_1 x $$

Linear Combination (z)

What it is: The input to the sigmoid function, a weighted sum of features.

Note: Same as linear regression, but fed into sigmoid.

$$ L = -[y \cdot \ln(\hat{y}) + (1-y) \cdot \ln(1-\hat{y})] $$

Log-Loss (Single Point)

What it is: Loss for one prediction. Penalizes confident wrong predictions heavily.

Note: Uses $\ln$ (natural log). Loss approaches infinity as prediction approaches wrong extreme.

$$ J = -\frac{1}{n}\sum_{i=1}^{n}\left[ y_i \ln(\hat{y}_i) + (1-y_i)\ln(1-\hat{y}_i) \right] $$

Binary Cross-Entropy (BCE)

What it is: Average Log-Loss across all training examples. The cost function to minimize.

Why it's useful:

Also known as: Log-Loss, Negative Log-Likelihood.

$$ J_{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 $$

MSE with Sigmoid (Avoid!)

What it is: Mean Squared Error applied to sigmoid output.

Why it's BAD:

Rule: Use MSE for regression, Log-Loss for classification.

$$ \frac{\partial J}{\partial w} = \frac{1}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i) \cdot x_i $$

Gradient for Log-Loss

What it is: Direction to update weights to minimize loss.

Why it's useful:

Update rule: $w := w - \alpha \cdot \frac{\partial J}{\partial w}$

$$ \hat{y} = \begin{cases} 1 & \text{if } \sigma(z) \geq 0.5 \\ 0 & \text{if } \sigma(z) < 0.5 \end{cases} $$

Classification Decision

What it is: Converting probability to class label using threshold.

Why it's useful:

Note: Decision boundary is where $z = 0$.