Playing with PyTorch and Datasets

6 min readFeb 24, 2020

PyTorch is the cool guy/girl in town. In this blog post I want to to give you a brief overview of what I think is really interesting about it. PyTorch is easy to use and it can be used to implement neural networks very quickly. See the original blog post here.

The main objective: quickly show PyTorch and PyTorch Dataset and how to use some of their cool features. Note that I am also going to ignore overfitting and related problems here. This is something in between a tutorial and a simple blog post.

I hope you have neural network knowledge and know how to do stuff with numpy :).

First, let’s make some data

We will create some data using sklearn.datasets module. I have chosen the moons dataset because it’s binary and nonlinear.

from sklearn.datasets import make_moons

X, y = make_moons(n_samples=200, shuffle=True, noise=None, random_state=None)

colors = {0 : "red", 1 : "blue"}

for color_index, (point_x, point_y) in zip(y, X):
    plt.plot(point_x, point_y, color=colors[color_index], linestyle='dashed', marker='o', markersize=5)
    plt.show()

And then there were Datasets

Dataset is one of the most lovely classes inside the PyTorch framework. It is so simple yet so helpful. You could definitely start training a network without using this class, but it actually allows us to abstract some components and it will make easier to understand the flow.

Datasets allow you to define a dataset (surprise) that contains your data. It comes with cool features related to memory usage and also the possibility of automatically define the batches for your data.

This is the structure of a simple dataset object.

class DatasetSkeleton(Dataset):
    def __init__(self):
        pass

    def __len__(self):
        pass
        
    def __getitem__(self, idx):
        pass

Simple skeleton of a PyTorch Dataset object.

It should be pretty simple to understand. The __init__ method is usually used to get the data into the object (e.g., numpy data). In general, you will pass both input and target (i.e., the points and the labels) and save them in two different variables. You also have to provide methods to define the length of your dataset and to extract one item given an index.

Building a specific class for our experiment is pretty easy, we just need to pass the data we have.

class MoonDataset(Dataset):
    def __init__(self, X, labels):
        self.X = X
        self.labels = labels

    def __len__(self):
        return len(self.X)
        
    def __getitem__(self, idx):
        return self.X[idx], self.labels[idx]

Our MoonDataset. See that we have two different variables since we have input and labels and that the __getitem__ method just make index access to return a specific element.

Ok! Now we are ready to split data and to create our datasets (one for training and one for validation). We are also going to make use of the DataLoader class, which allows us to wrap our Dataset into an object that is a generator.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

moon_data_training = MoonDataset(X_train, y_train)
moon_data_testing = MoonDataset(X_test, y_test

test_loader = DataLoader(moon_data_testing, batch_size=25)
train_loader = DataLoader(moon_data_training, batch_size=25)

Simple Logistic Regression

Let’s now use the dataset we just created to train a simple logistic regression with PyTorch. A binary logistic regression can be implemented in PyTorch using a simple linear layer and a sigmoid function.

from torch import nn
import torch
import torch.nn.functional as F

class Logistic(nn.Module):
    def __init__(self, size):
            super(Logistic, self).__init__()
            self.linear = nn.Linear(size, 1) # linear layer
            
    def forward(self, x):
            out = self.linear(x) # apply linear layer
            return F.sigmoid(out) # sigmoid activation function

You can see that it is pretty easy to create a logistic regressor in PyTorch. In the __init__ method you declare the neural network components you are going to use: since we are doing logistic regression we just need a linear layer. On the other hand, the forward method has to model the way the data should pass from the input layer to the output layer. We just need it to go directly in the linear layer and then through a sigmoid function to get our value in the range [0,1].

Next, let’s declare our model, loss and the optimizer.

model = Logistic(2) # let's make a 2 dimensional logistic regressor

criterion = nn.BCELoss() # we are going to use binary cross entropy
model.double() # use double values for the model

learning_rate = 0.01 # learning rate of the optimizer
optimizer = torch.optim.RMSprop(model.parameters(), lr=learning_rate)

Let’s Train and Test!

Since I’m using a linear model, I’m not expecting a perfect accuracy on this task, since have we said in the first part of the blog post, our data is nonlinear. Here’s the training loop in PyTorch. Remember to zero out the gradients and to compute backpropagation.

for epoch in range(50): # 50 epochs

    for X_batch, y_batch in train_loader: # the generator returns the input and the labels. These are already grouped in batches.
    	y_batch = y_batch.double()        optimizer.zero_grad() # zero out the gradients 
        outputs = model(X_batch) # use the model         outputs = outputs.squeeze() # squeeze the output
        
        loss = criterion(outputs, y_batch) # compute the loss
        
        loss.backward() # backpropagate
        optimizer.step() # update the values

Accuracy is easy to compute.

correct = 0
total = 0
with torch.no_grad(): # since we are in test-phase we do not need to compute the gradients

    for X_batch, y_batch in test_loader:
    
        outputs = model(X_batch)

        outputs = outputs.squeeze()
        outputs = np.array(list(map(round, outputs.detach().numpy()))) # we have to round the values to either 0 or 1 to compute the accuracy score

        total += len(y_batch)
        correct += (outputs == y_batch.detach().numpy()).sum()

    accuracy = 100 * correct / total
    print(accuracy)

Compute accuracy for our network

You should get something like 83% accuracy with this method.

Plotting the decision boundary for a linear model is not that difficult, but this will not allow us to plot non-linear boundaries. Non-linear boundaries are difficult to plot because they are…non-linear. So, I’m using a snippet of code that is really useful to do this, but that, unfortunately, requires more writing to be explained. See [1] for more details on how this works.

for color_index, (point_x, point_y) in zip(y, X):
    plt.plot(point_x, point_y, color=colors[color_index], 
    linestyle='dashed', marker='o', markersize=5)h = .02  x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))with torch.no_grad():
	Z = model(torch.DoubleTensor(np.c_[xx.ravel(), yy.ravel()]))Z = np.array(list(map(round, Z.squeeze().detach().numpy()))).reshape(xx.shape)plt.contour(xx, yy, Z, cmap=plt.cm.binary)

Something with more layers

What would happen if we used something with more layers? well, the idea is to capture the nonlinear aspect of the data we are using. I have implemented a slightly bigger network here.

from torch import nn
import torch
import torch.nn.functional as Fclass BiggerNetwork(nn.Module):
    def __init__(self, size):
            super(SecondOrderNetwork, self).__init__()
            self.linear = nn.Linear(size, 10)
            self.linear2 = nn.Linear(10, 5)
            self.linear3 = nn.Linear(5, 1)
            
    def forward(self, x):
            out = self.linear(x)
            out = F.tanh(out)
            out = self.linear2(out)
            out = F.tanh(out)
            out = self.linear3(out)
            return torch.sigmoid(out)

You see I’ve added a few nonlinearities between the layers. When you train the model and test its accuracy, you will get 100% of correct answers (beware, overfitting). However, the cool thing about this is that we are now able to see the nonlinear boundary in the next picture (using the same code).

References

[1] https://stackoverflow.com/questions/22294241/plotting-a-decision-boundary-separating-2-classes-using-matplotlibs-pyplot