%load_ext autoreload
%autoreload 2
from logistic import LogisticRegression, GradientDescentOptimizer
from sparse_kernel_logistic import KernelLogisticRegression
Abstract
In this blog post, I implemented a sparse kernel logistic regression model. This model uses a kernel function to allow logistic regression to learn nonlinear patterns, and it uses \(\ell_1\) regularization to make the model sparse by setting many weights to zero.
I did many experiments to show and learn how the model behaves: I show how a large regularization strength (λ) makes the model sparse, how changing the kernel parameter (γ) affects the decision boundary, and alsohow the model can fit nonlinear data like the make_moons
dataset. Finally, I showed an example of overfitting by using a high γ and comparing training and testing performance with ROC curves.
Sparse Kernel Logistic Regression Implementation
For the implementation of my sparse_kernel_logistic.py where I implemented a custom logistic regression model that works in kernel space instead of feature space. This is different from regular logistic regression because instead of learning a weight vector \(w\), we learn a vector of dual weights \(a\) and use a kernel function to compute scores.
In my KernelLogisticRegression class I had the following function:
Score function: * To predict outputs for new data, the model computes the score [ s = K(X, X_{})^a ] where ( K ) is the kernel matrix. This is for the model to make nonlinear decisions too.
Loss function: * The loss has two parts: - The logistic loss. - An (_1) regularization on the dual weights (a) taht helps sparsity by making many entries of (a) equal to zero.
Gradient function: * I calculated the gradient of the total loss, including the gradient of the (_1) * This helps us to calculate the gradient descent.
Fit function: * Function that trained the model by calculating the kernel matrix, initializing ( a = 0 ), and updating ( a ) with gradient descent for many epochs.
Now lets set up our kernel for our experiments.
def rbf_kernel(X1, X2, gamma):
return torch.exp(-gamma * torch.cdist(X1, X2)**2)
Testing our implementaion of our kernellized logistic regression
So in our code below we will generate a simple dataset that is linearly separable adn will be testing our implementation.
import torch
from matplotlib import pyplot as plt
'seaborn-v0_8-whitegrid')
plt.style.use(
def classification_data(n_points = 300, noise = 0.2, p_dims = 2):
= torch.arange(n_points) >= int(n_points/2)
y = 1.0*y
y = y[:, None] + torch.normal(0.0, noise, size = (n_points,p_dims))
X # X = torch.cat((X, torch.ones((X.shape[0], 1))), 1)
= X - X.mean(dim = 0, keepdim = True)
X return X, y
def plot_classification_data(X, y, ax):
assert X.shape[1] == 2, "This function only works for data created with p_dims == 2"
= [0, 1]
targets = ["o" , ","]
markers for i in range(2):
= y == targets[i]
ix 0], X[ix,1], s = 20, c = y[ix], facecolors = "none", edgecolors = "darkgrey", cmap = "BrBG", vmin = -1, vmax = 2, alpha = 0.8, marker = markers[i])
ax.scatter(X[ix,set(xlabel = r"$x_1$", ylabel = r"$x_2$")
ax.
= plt.subplots(1, 1)
fig, ax = classification_data(n_points = 100, noise = 0.4)
X, y plot_classification_data(X, y, ax)
Now that we have set up our dataset and also set up our Kernalized logistic regression, lets finally test it.
from sparse_kernel_logistic import KernelLogisticRegression
= KernelLogisticRegression(rbf_kernel, lam = 0.1, gamma = 1)
KR = 500000, lr = 0.0001) KR.fit(X, y, m_epochs
Lets now look at our vector Kr.a
1.0*(KR.a > 0.001)).mean() (
tensor(0.1000)
Now in our code below we will plot the scores along with the training data, and we will also highlight the pieces of the training data which have weights that are distinguishable from 0
= torch.abs(KR.a) > 0.001
ix
= torch.linspace(X[:,0].min() - 0.2, X[:,0].max() + 0.2, 101)
x1 = torch.linspace(X[:,1].min() - 0.2, X[:,1].max() + 0.2, 101)
x2
= torch.meshgrid(x1, x2, indexing='ij')
X1, X2
= X1.ravel()
x1 = X2.ravel()
x2
= torch.stack((x1, x2), dim = 1)
X_
= KR.score(X_, recompute_kernel = True)
preds = 1.0*torch.reshape(preds, X1.size())
preds
= plt.subplots(1, 1)
fig, ax = "lower", cmap = "BrBG",
ax.contourf(X1, X2, preds, origin = 2*preds.min() - preds.max(), vmax = 2*preds.max() - preds.min()
vmin
)
plot_classification_data(X, y, ax)0],X[ix, 1], facecolors = "none", edgecolors = "black") plt.scatter(X[ix,
We can see in the plot above that our model separates the two classes with a smooth, curved boundary. Only a some of the training points(the ones with black outline) have large influence on the decision and most other points have little or no effect. The RBF kernel helps create a nonlinear boundary that fits pir data well.
Part A
Experiment 1: Effect of Large λ (Sparsity)
In this experiment, I will show how increasing the regularization strength λ causes many entries of the weight vector \(a\) to become exactly zero. This is a feature of sparse models that they only rely on a few “important” training points.
I trained my model with a very large value of λ (i.e. λ = 10). After training, I counted how many entries of \(a\) are distinguishable from 0 by checking if they are greater than 0.001 in absolute value.
This expeiremnt showed that the model became extremely sparse. None of the training points had weights greater than 0.001. This means that with such a strong regularization (λ = 10), the model ignored all the training points and didn’t form a usable boundary. This shows what happens when regularization is very strong and the model becomes so sparse that it underfits the data which means it is not a good model.
= KernelLogisticRegression(rbf_kernel, lam=10.0, gamma=1)
KR =500000, lr=0.0001)
KR.fit(X, y, m_epochs1.0 * (KR.a.abs() > 0.001)).mean() (
tensor(0.)
Experiment 2: Effect of γ on the Decision Boundary
For the second experiment I tried experimented with different values of the kernel parameter γ. This parameter controls the “wiggliness” of the decision boundary. A small γ leads to smoother decision boundaries, while large γ results in very wiggly boundaries.
I trained two models with the same λ, but different γ values (γ = 0.1 vs. γ = 10), and plotted their decision boundaries and saw that the model with higher γ produced a much more complex decision boundary. In our plots below we can see both models and how changing γ leads to different boundary behaviors.
# using a Low gamma = smooth
= KernelLogisticRegression(rbf_kernel, lam=0.1, gamma=0.1)
KR_smooth =500000, lr=0.0001)
KR_smooth.fit(X, y, m_epochs
# experimenting witha High gamma for wiggly
= KernelLogisticRegression(rbf_kernel, lam=0.1, gamma=10)
KR_wiggly =500000, lr=0.0001) KR_wiggly.fit(X, y, m_epochs
# grid for teh plot for plotting
= torch.linspace(X[:, 0].min() - 0.2, X[:, 0].max() + 0.2, 101)
x1 = torch.linspace(X[:, 1].min() - 0.2, X[:, 1].max() + 0.2, 101)
x2 = torch.meshgrid(x1, x2, indexing='ij')
X1, X2 = torch.stack([X1.ravel(), X2.ravel()], dim=1)
grid
# Plot for KR_smooth
= KR_smooth.score(grid, recompute_kernel=True)
preds_smooth = preds_smooth.reshape(X1.shape)
preds_smooth
= plt.subplots()
fig, ax ="BrBG", levels=20)
ax.contourf(X1, X2, preds_smooth, cmap
plot_classification_data(X, y, ax)"Decision Boundary with gamma = 0.1")
ax.set_title(
plt.show()
# plot for KR_wiggly
= KR_wiggly.score(grid, recompute_kernel=True)
preds_wiggly = preds_wiggly.reshape(X1.shape)
preds_wiggly
= plt.subplots()
fig, ax ="BrBG", levels=20)
ax.contourf(X1, X2, preds_wiggly, cmap
plot_classification_data(X, y, ax)"Decision Boundary with gamma = 10")
ax.set_title( plt.show()
Experiment 3: Learning Nonlinear Patterns with make_moons
In this experiment, I tested how well the sparse kernel logistic regression model can handle nonlinear data. I used the make_moons
dataset, which contains two interlocking half circles which is something taht a linear model cant separate.
I trained my kernelized model with an RBF kernel (λ = 0.1, γ = 5). The plot of our decision boundary was smooth and curved, perfectly separating the two classes.
This shows that by applying a kernel function, we can make the data in a higher-dimensional space to be linearly separable and the logistic regression model can still be applied there.
from sklearn.datasets import make_moons
# Make training data
= make_moons(n_samples=300, noise=0.2)
X_moons_np, y_moons_np = torch.tensor(X_moons_np, dtype=torch.float32)
X_moons = torch.tensor(y_moons_np, dtype=torch.float32)
y_moons
# Training our model
= KernelLogisticRegression(rbf_kernel, lam=0.1, gamma=5)
KR_moons =500000, lr=0.0001) KR_moons.fit(X_moons, y_moons, m_epochs
By Plotting the decision boundary for our experiemnt we can see taht our model was able to separate the two half-moon shapes using a smooth, curved decision boundary.
This shows the that RBF kernel can help the model to create flexible, nonlinear boundaries that fit the true shape of the data very well helpong teh model have a good performance.
= torch.linspace(X_moons[:,0].min() - 0.2, X_moons[:,0].max() + 0.2, 101)
x1 = torch.linspace(X_moons[:,1].min() - 0.2, X_moons[:,1].max() + 0.2, 101)
x2 = torch.meshgrid(x1, x2, indexing='ij')
X1, X2
= torch.stack((X1.ravel(), X2.ravel()), dim=1)
X_grid = KR_moons.score(X_grid, recompute_kernel=True)
preds = torch.sigmoid(preds).reshape(X1.shape)
preds
= plt.subplots()
fig, ax ="BrBG", alpha=0.7)
ax.contourf(X1, X2, preds, cmap
plot_classification_data(X_moons, y_moons, ax)"Decision Boundary on make_moons Dataset") plt.title(
Text(0.5, 1.0, 'Decision Boundary on make_moons Dataset')
Part B
Demonstrating Overfitting
In our code below we generated two datasets one for training and the other for testing. We then trained our model with a bad Gamma so that we can cause overfitting. By setting our gamma high to somewhere around 5-10 we force the model to overfit.
# Training data
= make_moons(n_samples=300, noise=0.4)
X_train_np, y_train_np = torch.tensor(X_train_np, dtype=torch.float32)
X_train = torch.tensor(y_train_np, dtype=torch.float32)
y_train
# Testing data
= make_moons(n_samples=300, noise=0.4)
X_test_np, y_test_np = torch.tensor(X_test_np, dtype=torch.float32)
X_test = torch.tensor(y_test_np, dtype=torch.float32)
y_test
#Training our model
= KernelLogisticRegression(rbf_kernel, lam=0.1, gamma=5)
KR_overfit =500000, lr=0.0001)
KR_overfit.fit(X_train, y_train, m_epochs
Now that we have trained our model, we will now evaluate our model and get predictions on teh train and test set.
Plotting our ROC curve
In our plot The training ROC curve is very close to the top-left (meaning high performance), while the test ROC curve is much lower, indicating poor generalization.
This large gap between the training and testing performance means our model is overfitting. The model fits the noise in the training data instead of learning a decision boundary that generalizes the new examples/ test data.
from sklearn.metrics import roc_curve, auc
# predicting on the training data and the test data
= torch.sigmoid(KR_overfit.score(X_train)).detach().numpy()
train_scores = torch.sigmoid(KR_overfit.score(X_test)).detach().numpy()
test_scores
# calculate the ROC curves
= roc_curve(y_train, train_scores)
fpr_train, tpr_train, _ = roc_curve(y_test, test_scores)
fpr_test, tpr_test, _
# Plot ROC curves
plt.figure()=f"Train ROC (AUC = {auc(fpr_train, tpr_train):.2f})")
plt.plot(fpr_train, tpr_train, label=f"Test ROC (AUC = {auc(fpr_test, tpr_test):.2f})")
plt.plot(fpr_test, tpr_test, label"False Positive Rate")
plt.xlabel("True Positive Rate")
plt.ylabel("ROC Curve: Overfitting Example (High Gamma)")
plt.title(
plt.legend()
plt.grid() plt.show()
Conclusion
In this project, I built a Sparse Kernal Machine which is version of logistic regression that encourages sparsity. I tested how different settings of λ and γ affect the model’s behavior.
I learned that using a large λ makes the model sparse by reducing the number of important training points. I also learned that adjusting γ changes how flexible the decision boundary is: small γ values make smoother boundaries, while large γ values create more complex ones.
Lastly, I saw how using a high γ can cause the model to overfit, doing well on training data but poorly on new testing data. Overall, this project helped me better understand how regularization and kernel methods work together to affect model complexity and generalization.