%load_ext autoreload
%autoreload 2
from perceptron import Perceptron, PerceptronOptimizer
File of Perceptron.py https://github.com/Pbabu-Github/csci-0451/blob/main/posts/perceptron/perceptron.py
Abstract
In this blog post, I implement the perceptron algorithm using PyTorch. I experiment with linear boundaries that separate two classes in a dataset, as well as cases where a clear separation does not exist. My implementation includes a LinearModel class for basic operations (computing scores and predictions) and a Perceptron subclass that defines the loss (misclassification rate) and gradient update. The algorithm updates its weight vector whenever a data point is misclassified. Additionally, I use visualization functions to display the training data and, for 2D cases, both the decision boundary and the evolution of the loss over time.
Walk through of Grad function
The grad() function in my implementation is responsible for computing the weight update for a single example. So what i did was:
First, I made sure the data point X is in the right shape. If it came in as a row (1, p), I turned it into a 1D vector (p,).
Then I checked if the label y is in {-1, 1}. If it wasnt, I converted it using y_mod = 2 * y - 1.
I calculated the score by taking the dot product of the weights and the input (s = w • x). This score tells us how far the point is from the decision boundary.
If s * y_mod is less than 0, that means the point is misclassified, so I returned the gradient as -y_mod * X. This will move the weights in the right direction.
If the point was classified correctly, I just returned a zero vector (so there would be no update needed).
This matches the perceptron update rule taht if a point is wrong, we update the weights with (2y - 1) * x. Lets now perfom some experiments to check our implementation. Before that, teh code below does a simple check to see if our implementation is working by creating 2d data and perfroms the training and prints our loss after our update step.
import torch
import matplotlib.pyplot as plt
from perceptron import Perceptron, PerceptronOptimizer, perceptron_data, plot_perceptron_data
# Generate the 2D data.
= perceptron_data(n_points=300, noise=0.2)
X, y
# Plot the training data.
= plt.subplots(figsize=(6, 6))
fig, ax
plot_perceptron_data(X, y, ax)"Perceptron Training Data")
ax.set_title(
plt.show()
# Create a perceptron and an optimizer.
= Perceptron()
p = PerceptronOptimizer(p)
opt = opt.step(X[0:1], y[0]) # update using a single data point.
loss print("Loss after one step:", loss.item())
Loss after one step: 1.0
Experiment 1: Perceptron Training on 2D data
In the experiment below, I use the perceptron algorithm to classify 2D data that is linearly separable.
We first generate 2D data using
perceptron_data()
, which creates two clearly separated classes and adds a bias column.We next initialize a
Perceptron
model and an optimizer. For each iteration we calculate the current loss (misclassification rate), randomly select a single data point and if the model misclassifies it, we update the weights using the Perceptron rule.We repeat this until the model achieves zero loss or we reach the maximum number of iterations.
# Experiment 1: 2D Data (Linearly Separable
= perceptron_data(n_points=300, noise=0.2) #generate 2d data
X2, y2 = Perceptron()
p2 = PerceptronOptimizer(p2)
opt2 = []
loss_vec = X2.size(0)
n = 1000
max_iter for _ in range(max_iter):
= p2.loss(X2, y2)
loss
loss_vec.append(loss.item())if loss.item() == 0: break
= torch.randint(n, (1,))
i
opt2.step(X2[[i], :], y2[i])
#Plot decision boundary on 2D data:
= p2.w # final weight vector: [w1, w2, bias]
w = torch.linspace(X2[:,0].min()-0.5, X2[:,0].max()+0.5, 100)
x1_vals = -(w[0]*x1_vals + w[2]) / w[1] # boundary: w1*x1 + w2*x2 + bias = 0
x2_vals =(6,6))
plt.figure(figsize0], X2[:,1], c=y2)
plt.scatter(X2[:,=2)
plt.plot(x1_vals, x2_vals, linewidth"x1")
plt.xlabel("x2")
plt.ylabel("Decision Boundary on 2D Data")
plt.title( plt.show()
Visualizing the loss evolution
#plot loss evolution
=(6,4))
plt.figure(figsize='o', color='slategrey')
plt.plot(loss_vec, marker"Iteration")
plt.xlabel("Loss")
plt.ylabel("Loss Evolution (2D Data)")
plt.title( plt.show()
After training we plot the loss vs. iteration to show how the loss decreases as the model learns adn we also plot the final decision boundary over the 2D data. The red and blue dots show the two classes and the straight line separates them. This shows that the perceptron can successfully find a separating line when the data is linearly separable.
Runtime Complexity of a Single Perceptron Iteration
In each iteration of perceptron training, we:
- Compute a dot product between the weight vector
w
and a single input examplex
. Ifx
hasp
features, this takes O(p) time. - If the example is misclassified, we compute the gradient and update the weights — this also takes O(p) time.
So overall, a single iteration takes O(p) time. It doesn’t depend on the total number of data points n
, because we only update using one randomly selected point per iteration.
Looking at the Loss Evolution graph above, we can also see that the model usually improves quickly (loss drops) and then flattens out. Even if a spike happens, it corrects quickly with just a couple more iterations. This shows that each iteration is efficient and contributes quickly to learning — especially when the data is linearly separable like in our 2D example.
Experiment 2: Perceptron Training on High-Dimensional Data (3 Features)
In this experiment, I train the perceptron algorithm on a dataset with 5 features instead of just 2.
- We generate high-dimensional data using
high_dim_data()
, which creates 5-dimensional data points (4 random features plus a constant column for bias). The two classes are linearly separable, but harder to visualize. - We initialize a
Perceptron
model and an optimizer. As before, in each iteration we:- calculate the current loss,
- randomly select a data point, and
- apply the perceptron update rule if that point is misclassified.
- We keep updating until the loss reaches zero or we hit the maximum number of iterations.
# Experiment 2: High-Dimensional Data (5 Features)
def high_dim_data(n_points=300, noise=0.2, p_dims=5):
# Generate data with p_dims-1 features and append constant column.
= torch.arange(n_points) >= n_points//2
y = y[:, None].float() + torch.normal(0.0, noise, size=(n_points, p_dims-1))
X = torch.cat((X, torch.ones((n_points, 1))), 1)
X = 2 * y.float() - 1 # convert targets to {-1, 1}
y return X, y
= high_dim_data(n_points=300, noise=0.2, p_dims=5)
X5, y5 = Perceptron()
p5 = PerceptronOptimizer(p5)
opt5 = []
loss_vec_hd = X5.size(0)
n_hd for _ in range(max_iter):
= p5.loss(X5, y5)
loss
loss_vec_hd.append(loss.item())if loss.item() == 0: break
= torch.randint(n_hd, (1,))
i
opt5.step(X5[[i], :], y5[i])
# Plot loss evolution for high-dimensional data:
=(6,4))
plt.figure(figsize='o', color='teal')
plt.plot(loss_vec_hd, marker"Iteration")
plt.xlabel("Loss")
plt.ylabel("Loss Evolution (High-Dimensional Data)")
plt.title( plt.show()
Since the data lives in more than 2 dimensions, we cannot directly plot the decision boundary. But we still visualize the loss over time to see how the model is learning and the graph above shows that the perceptron is still able to learn from high-dimensional data. The loss decreases quickly, though it fluctuates a bit more than in the 2D case — this is makes sense, because higher dimensions can make it harder to find a good separating boundary.
Runtime Complexity of a Single Perceptron Iteration (High-Dimensional Case)
Even though the data has more features now, the complexity of each iteration stays the same:
- We compute the dot product
w · x
wherex
hasp
features → this takes O(p) time. - If the point is misclassified, the gradient computation and weight update also take O(p) time.
So a single update step still runs in O(p) time. The number of data points n
does not affect the time for each step because we’re only using one data point per iteration.
From the Loss Evolution graph above, we can see that perceptron learns effectively. The drops in loss happen fast, confirming that even in high dimensions, each iteration helps the model improve quickly — as long as the data is linearly separable.
Experiment 3: Perceptron on 2D Data that is not linearly Separable
# Generate nonseparable 2D data by increasing noise.
= perceptron_data(n_points=300, noise=0.6)
X_ns, y_ns = Perceptron()
p_ns = PerceptronOptimizer(p_ns)
opt_ns = []
loss_vec = X_ns.size(0)
n = 1000
max_iter
# Training loop: update on a random point each iteration.
for _ in range(max_iter):
= p_ns.loss(X_ns, y_ns).item()
loss
loss_vec.append(loss)# If by chance loss becomes 0, we could break—but for nonseparable data, it won't.
= torch.randint(n, (1,))
i
opt_ns.step(X_ns[[i], :], y_ns[i])
# Plot the data and the final decision boundary.
= p_ns.w # final weight vector [w1, w2, bias]
w = torch.linspace(X_ns[:,0].min()-0.5, X_ns[:,0].max()+0.5, 100)
x_vals = -(w[0]*x_vals + w[2]) / w[1] # decision boundary: w1*x + w2*y + bias = 0
y_vals =(6,6))
plt.figure(figsize= plt.gca()
ax
plot_perceptron_data(X_ns, y_ns, ax)=2)
ax.plot(x_vals, y_vals, linewidth"x1")
ax.set_xlabel("x2")
ax.set_ylabel("Final Decision Boundary (Nonseparable Data)")
ax.set_title(
plt.show()
# Plot evolution of loss.
=(6,4))
plt.figure(figsize='o', markersize=3, color='slategrey')
plt.plot(loss_vec, marker"Iteration")
plt.xlabel("Loss")
plt.ylabel("Loss Evolution (Nonseparable Data)")
plt.title( plt.show()
In our experiment above , I tested the perceptron algorithm on 2D data that is not linearly separable by increasing the noise when generating data with perceptron_data()
.
We use a noise level of
0.6
which causes the two classes to overlap significantly. This makes it impossible to draw a perfect straight line that separates all the points.We then train the perceptron for up to 1000 iterations. Like before, we calculate the loss at each step and perform updates on a randomly chosen data point.
After training, we plot the loss over time and the final decision boundary. Since the data is not linearly separable, the loss never reaches zero. The model keeps fluctuating as it tries to fit the noisy data.
The plot of the final decision boundary shows that the perceptron still finds a line that roughly splits the two classes, but some red and blue points are still on the wrong side. The loss curve also shows this as it goes around up and down and never stays flat like it did in the previous experiments. This shows a key limitation of the basic perceptron that it only works well when the data is linearly separable
Implementing Minibatch Perceptron
Walk through of grad function in Minibatch implementation
The grad() function in my minibatch implementation is responsible for computing the weight updates for multiple examples at once (a minibatch). I did this by
First, I made sure the labels (
y
) were in the set {-1, 1}. If they weren’t, I converted them using:= torch.where((y == 1) | (y == -1), y, 2 * y - 1) y_mod
Then, I calculated the scores (
s
) for the entire minibatch simultaneously by multiplying the feature matrix (X
) by the weight vector (w
):= X @ self.w s
After that, I checked which data points were misclassified by seeing where
s * y_mod
was less than 0.If there were any misclassified points, I computed the gradient by averaging the updates from all misclassified points:
= -(y_mod[misclassified, None] * X[misclassified]).mean(dim=0) grad
If all points were classified correctly, I simply returned a zero vector (since no update would be needed).
This matches the minibatch perceptron update rule: if multiple points are wrong, we average their updates together to adjust our weights efficiently. Let’s now perform some experiments with different batch sizes to check how this implementation performs.
In the code below, I implemented the Minibatch Perceptron algorithm to examine how varying the batch size (k) impacts the training process and convergence behavior.
The function minibatch(batch_size) performs training on randomly selected subsets (minibatches) of data points. Each minibatch is used to update the perceptron weights according to the perceptron learning rule.
Within the training loop, the loss is calculated after every update and we store it in the loss_history. If the loss reaches zero (which would indicate perfect classification), the training stops early.
We then by visualize the evolution of the loss, showing how quickly and effectively the perceptron learns for each specified minibatch size.
I will be calling this function with various batch sizes (k=1, k=10, k=n) to show how the perceptron’s learning rate differ across these scenarios:
import torch
from perceptron import Perceptron, PerceptronOptimizer, perceptron_data
import matplotlib.pyplot as plt
# Generate data
= perceptron_data(n_points=300, noise=0.2)
X, y # Create a perceptron and an optimizer
= Perceptron()
model = PerceptronOptimizer(model)
optimizer = 1000
max_iter = 0.1 # Learning rate
alpha
def minibatch(batch_size):
# Loss tracking
= []
loss_history
for _ in range(max_iter):
= torch.randperm(X.size(0))[:batch_size] # random minibatch selection
indices = X[indices], y[indices]
X_batch, y_batch
= optimizer.step(X_batch, y_batch, alpha)
loss # adding loss to history
loss_history.append(loss.item())
# we stop if loss reaches zero (perfect separation)
if loss.item() == 0:
break
# Plot loss evolution
=(6, 4))
plt.figure(figsize='o')
plt.plot(loss_history, marker"Iteration")
plt.xlabel("Loss")
plt.ylabel(f"Loss Evolution (Minibatch size: {batch_size})")
plt.title( plt.show()
Experiment: Minibatch Size = 1
In this experiment, I set the minibatch size to 1, which is the same as the standard perceptron update rule (one point at a time). As shown in the plot, the model quickly converges — it only takes two iterations for the loss to drop from 1.0 to 0.0. This is because when the data is linearly separable, since each misclassified point directly contributes to updating the weights. It confirms that this minibatch implementation works just like the regular perceptron when k = 1
.
1)#when batch size is 1 minibatch(
Experiment: Minibatch Size = 10
In this experiment, I set the minibatch size to 10. The perceptron model updates its weights using 10 data points at a time instead of just one. As shown in the graph, the loss starts high and fluctuates for the first few steps, but overall it keeps going down. Around the 10th iteration, the model reaches zero loss, which means it has learned to classify the data correctly. This shows that using a small batch size still works well and the model is able to find a good separating line.
10)# batch size 10 minibatch(
Experiment: Minibatch Size = n
In this experiment, I used a minibatch size of 300 because we have 300 points, which means the perceptron updates its weights using the entire dataset at each step. As shown in the plot, the loss starts off high and stays flat for a few steps, but then it begins to decrease smoothly. By the 12th iteration, the loss reaches almost zero. This shows that even when using the whole dataset at once, the perceptron is still able to learn and improve — just a bit more gradually. The updates are more stable and less noisy compared to smaller batch sizes.
300)# batch size 300 which is the number of points minibatch(
Runtime Complexity of Minibatch implementation
In my minibatch perceptron implementation, each iteration takes time based on how many points are in the batch and how many features each point has. If the batch has k
points and each point has p
features, then the runtime is O(kp). This means the more features or bigger the batch, the it will take longer for each update to happen.
Conclusion
In this blog post, I learned to build and implement the Perceptron algorithm using PyTorch. I started with linearly separable data and showed that the perceptron can successfully learn a straight-line boundary that separates the two classes. Then, I tried the algorithm on higher-dimensional data and saw that it still learns well. Finally, I tested it on noisy, nonseparable data and noticed that the perceptron could not find a perfect boundary, and the loss never reached zero.
These experiments show that the perceptron works best when the data is linearly separable. It’s fast and simple, but has limitations when data is noisy or can’t be split with a straight line.
I also learnt how the perceptron behaves under different minibatch sizes. With small batches (like 1 or 10), the model updates quickly but sometimes with more noise. Using the entire dataset as a batch (size 300) made the updates smoother and more stable, though slightly slower. Overall, these experiments helped me understand how minibatch size affects convergence and learning behavior in perceptron training.