# Playing with PyTorch and Datasets

PyTorch is the cool guy/girl in town. In this blog post I want to to give you a brief overview of what I think is really interesting about it. PyTorch is easy to use and it can be used to implement neural networks very quickly. See the original blog post here.

**The main objective: **quickly show PyTorch and PyTorch Dataset and how to use some of their cool features. Note that I am also going to ignore overfitting and related problems here. This is something in between a tutorial and a simple blog post.

I hope you have neural network knowledge and know how to do stuff with numpy :).

# First, let’s make some data

We will create some data using **sklearn.datasets** module. I have chosen the moons dataset because it’s binary and nonlinear.

**from** sklearn.datasets **import** make_moons

X, y = make_moons(n_samples=200, shuffle=True, noise=None, random_state=None)

colors = {0 : "red", 1 : "blue"}

**for** color_index, (point_x, point_y) **in** zip(y, X):

plt.plot(point_x, point_y, color=colors[color_index], linestyle='dashed', marker='o', markersize=5)

plt.show()

# And then there were Datasets

Dataset is one of the most lovely classes inside the PyTorch framework. It is so simple yet so helpful. You could definitely start training a network without using this class, but it actually allows us to abstract some components and it will make easier to understand the flow.

Datasets allow you to define a *dataset* (surprise) that contains your data. It comes with cool features related to memory usage and also the possibility of automatically define the batches for your data.

This is the structure of a simple dataset object.

**class** DatasetSkeleton(Dataset):

**def** __init__(self):

pass

**def** __len__(self):

pass

**def** __getitem__(self, idx):

pass

Simple skeleton of a PyTorch Dataset object.

It should be pretty simple to understand. The **__init__** method is usually used to get the data into the object (e.g., numpy data). In general, you will pass both input and target (i.e., the points and the labels) and save them in two different variables. You also have to provide methods to define the length of your dataset and to extract one item given an index.

Building a specific class for our experiment is pretty easy, we just need to pass the data we have.

**class** MoonDataset(Dataset):

**def** __init__(self, X, labels):

self.X = X

self.labels = labels

**def** __len__(self):

return len(self.X)

**def** __getitem__(self, idx):

return self.X[idx], self.labels[idx]

Our **MoonDataset**. See that we have two different variables since we have input and labels and that the **__getitem__** method just make index access to return a specific element.

Ok! Now we are ready to split data and to create our datasets (one for training and one for validation). We are also going to make use of the DataLoader class, which allows us to wrap our Dataset into an object that is a generator.

**from** sklearn.model_selection **import** train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

moon_data_training = MoonDataset(X_train, y_train)

moon_data_testing = MoonDataset(X_test, y_test

test_loader = DataLoader(moon_data_testing, batch_size=25)

train_loader = DataLoader(moon_data_training, batch_size=25)

# Simple Logistic Regression

Let’s now use the dataset we just created to train a simple logistic regression with PyTorch. A binary logistic regression can be implemented in PyTorch using a simple linear layer and a sigmoid function.

**from** torch **import** nn

**import** torch

**import** torch.nn.functional **as** F

**class** Logistic(nn.Module):

**def** __init__(self, size):

super(Logistic, self).__init__()

self.linear = nn.Linear(size, 1) # linear layer

**def** forward(self, x):

out = self.linear(x) # apply linear layer

return F.sigmoid(out) # sigmoid activation function

You can see that it is pretty easy to create a logistic regressor in PyTorch. In the **__init__** method you declare the neural network components you are going to use: since we are doing logistic regression we just need a linear layer. On the other hand, the **forward **method has to model the way the data should pass from the input layer to the output layer. We just need it to go directly in the linear layer and then through a sigmoid function to get our value in the range [0,1].

Next, let’s declare our model, loss and the optimizer.

`model = Logistic(2) # let's make a 2 dimensional logistic regressor`

criterion = nn.BCELoss() # we are going to use binary cross entropy

model.double() # use double values for the model

learning_rate = 0.01 # learning rate of the optimizer

optimizer = torch.optim.RMSprop(model.parameters(), lr=learning_rate)

# Let’s Train and Test!

Since I’m using a linear model, I’m not expecting a perfect accuracy on this task, since have we said in the first part of the blog post, our data is nonlinear. Here’s the training loop in PyTorch. Remember to zero out the gradients and to compute backpropagation.

forepochinrange(50): # 50 epochs

forX_batch, y_batchintrain_loader: # the generator returns the input and the labels. These are already grouped in batches.

y_batch = y_batch.double() optimizer.zero_grad() # zero out the gradients

outputs = model(X_batch) # use the model outputs = outputs.squeeze() # squeeze the output

loss = criterion(outputs, y_batch) # compute the loss

loss.backward() # backpropagate

optimizer.step() # update the values

Accuracy is easy to compute.

`correct = 0`

total = 0

**with** torch.no_grad(): # since we are in test-phase we do not need to compute the gradients

**for** X_batch, y_batch **in** test_loader:

outputs = model(X_batch)

outputs = outputs.squeeze()

outputs = np.array(list(map(round, outputs.detach().numpy()))) # we have to round the values to either 0 or 1 to compute the accuracy score

total += **len**(y_batch)

correct += (outputs == y_batch.detach().numpy()).sum()

accuracy = 100 * correct / total

print(accuracy)

Compute accuracy for our network

You should get something like 83% accuracy with this method.

Plotting the** decision boundary **for a linear model is not that difficult, but this will not allow us to plot non-linear boundaries. Non-linear boundaries are difficult to plot because they are…non-linear. So, I’m using a snippet of code that is really useful to do this, but that, unfortunately, requires more writing to be explained. See [1] for more details on how this works.

forcolor_index, (point_x, point_y)inzip(y, X):

plt.plot(point_x, point_y, color=colors[color_index],

linestyle='dashed', marker='o', markersize=5)h = .02 x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1

y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, h),

np.arange(y_min, y_max, h))withtorch.no_grad():

Z = model(torch.DoubleTensor(np.c_[xx.ravel(), yy.ravel()]))Z = np.array(list(map(round, Z.squeeze().detach().numpy()))).reshape(xx.shape)plt.contour(xx, yy, Z, cmap=plt.cm.binary)

# Something with more layers

What would happen if we used something with **more layers**? well, the idea is to capture the nonlinear aspect of the data we are using. I have implemented a slightly bigger network here.

fromtorchimportnnimporttorchimporttorch.nn.functionalasFclassBiggerNetwork(nn.Module):

def__init__(self, size):

super(SecondOrderNetwork, self).__init__()

self.linear = nn.Linear(size, 10)

self.linear2 = nn.Linear(10, 5)

self.linear3 = nn.Linear(5, 1)

defforward(self, x):

out = self.linear(x)

out = F.tanh(out)

out = self.linear2(out)

out = F.tanh(out)

out = self.linear3(out)

return torch.sigmoid(out)

You see I’ve added a few **nonlinearities** between the layers. When you train the model and test its accuracy, you will get 100% of correct answers (beware, overfitting). However, the cool thing about this is that we are now able to see the nonlinear boundary in the next picture (using the same code).