PyTorch binary classification 예제 - PyTorch binary classification yeje

PyTorch로 시작하는 딥 러닝 입문 00. 파이토치 공식 문서 링크 01. 책 소개하기 02. 파이토치 기초(PyTorch Basic) 01. 파이토치 패키지의 기본 구성 02. 텐서 조작하기(Tensor Manipulation) 1 03. 텐서 조작하기(Tensor Manipulation) 2 04. 파이썬 클래스(class) 03. 선형 회귀(Linear Regression) 01. 선형 회귀(Linear Regression) 02. 자동 미분(Autograd) 03. 다중 선형 회귀(Multivariable Linear regression) 04. nn.Module로 구현하는 선형 회귀 05. 클래스로 파이토치 모델 구현하기 06. 미니 배치와 데이터 로드(Mini Batch and Data Load) 07. 커스텀 데이터셋(Custom Dataset) 04. 로지스틱 회귀(Logistic Regression) 01. 로지스틱 회귀(Logistic Regression) 02. nn.Module로 구현하는 로지스틱 회귀 03. 클래스로 파이토치 모델 구현하기 05. 소프트맥스 회귀(Softmax Regression) 01. 원-핫 인코딩(One-Hot Encoding) 02. 소프트맥스 회귀(Softmax Regression) 이해하기 03. 소프트맥스 회귀의 비용 함수 구현하기 04. 소프트맥스 회귀 구현하기 05. 소프트맥스 회귀로 MNIST 데이터 분류하기 06. 인공 신경망(Aritificial Neural Network) 01. 머신 러닝 용어 이해하기 02. 퍼셉트론(Perceptron) 03. XOR 문제 - 단층 퍼셉트론 구현하기 04. 역전파(BackPropagation) 05. XOR 문제 - 다층 퍼셉트론 구현하기 06. 비선형 활성화 함수(Activation function) 07. 다층 퍼셉트론으로 손글씨 분류하기 08. 다층 퍼셉트론으로 MNIST 분류하기 09. 과적합(Overfitting)을 막는 방법들 10. 기울기 소실(Gradient Vanishing)과 폭주(Exploding) 07. 합성곱 신경망(Convolutional Neural Network) 01. 합성곱과 풀링(Convolution and Pooling) 02. CNN으로 MNIST 분류하기 03. 깊은 CNN으로 MNIST 분류하기 08. 자연어 처리의 전처리 01. 자연어 처리 전처리 이해하기 02. 토치텍스트 튜토리얼(Torchtext tutorial) - 영어 03. 토치텍스트 튜토리얼(Torchtext tutorial) - 한국어 04. 토치텍스트(TorchText)의 batch_first 09. 단어의 표현 방법 01. NLP에서의 원-핫 인코딩(One-hot encoding) 02. 워드 임베딩(Word Embedding) 03. 워드투벡터(Word2Vec) 05. 임베딩 벡터의 시각화(Embedding Visualization) 06. 글로브(GloVe) 07. 파이토치(PyTorch)의 nn.Embedding() 08. 사전 훈련된 워드 임베딩(Pretrained Word Embedding) 10. 순환 신경망(Recurrent Neural Network) 01. 순환 신경망(Recurrent Neural Network, RNN) 02. 장단기 메모리(Long Short-Term Memory, LSTM) 11. 다대다 RNN을 이용한 텍스트 생성 01. 문자 단위 RNN(Char RNN) 02. 문자 단위 RNN(Char RNN) - 더 많은 데이터 03. 단어 단위 RNN - 임베딩 사용 12. 다대일 RNN을 이용한 텍스트 분류 01. 파이토치를 이용한 텍스트 분류(Text classification using PyTorch) 02. IMDB 리뷰 감성 분류하기(IMDB Movie Review Sentiment Analysis) 13. 시퀀스 레이블링(Sequence Labeling) 01. 시퀀스 레이블링(Sequence Labeling) 02. 양방향 RNN을 이용한 품사 태깅 14. 시퀀스투시퀀스(Sequence-to-Sequence, seq2seq) 01. 시퀀스투시퀀스(Sequence-to-Sequence, seq2seq) 15. 교육 문의

목차 Show

The dataset
Before we start – where is the code?
The packages
Step one – Building our first PyTorch component – DataLoaders
Creating the data loader to pull in CSV files
Magic method (dunder) in our class
Building our custom method to split our data into train and test splits
Step two – defining our multi-layer perceptron ANN
A wee bit of theory
Implementing the layers in our network
Create the forward passing mechanism
Step three – the training loop
Step four – evaluating the performance of our network
Step five – creating the prediction routine
Step six – prepare the data to use with our model
Using our custom classes to train the model
Loading the dataset
Training the model
Evaluating how well the model performs with the test data
Make a prediction against the model
Running the model against balanced dataset
Prepare and train
Evaluate the model
Predict with the model
That is that! You have done excellent!

This article was first published on Python – Hutsons-hacks , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

This assumes you know how to programme in Python and know a little about n-dimensional arrays and how to work with them in numpy (don’t worry if you don’t I got you covered).

PyTorch is a pythonic way of building Deep Learning neural networks from scratch. This is something I have been learning over the last 2 years, as historically my go to deep learning framework was Tensorflow.

For beginners to PyTorch it can be daunting to first work with the application as it forces you in the direction of building Python classes, inheritance and tensor and array programming. However, once you start to work with it you start to appreciate the power of PyTorch and how much control it gives you on the creation process of deep neural networks. I have since implemented these frameworks for Natrual Language Processing, Computer Vision, Transformers and audio classification. In this tutorial we will build up a MLP from the ground up and I will teach you what each step of my network is doing.

If you are ready – then let’s dive in! Open your mind and prepare to explore the wonderful and strange world of PyTorch.

The dataset

With this tutorial we will use a dataset from the MLDataR project on classifying whether a person has thyroid disease or not. This uses a number of variables to indicate if thyroid disease is present.

Before we start – where is the code?

There are two flavours:

Python script: https://github.com/StatsGary/PyTorch_Tutorials/blob/main/01_MLP_Thyroid_Classifier/PyTorch_Binary_From_Scratch.py
Jupyter Notebook: https://github.com/StatsGary/PyTorch_Tutorials/blob/main/01_MLP_Thyroid_Classifier/PyTorch_Binary_From_Scratch.ipynb
To clone the repo: git clone https://github.com/StatsGary/PyTorch_Tutorials.git.

The packages

We are going to need a number of different functions and packages. The first stage is to get all these imported, if they are not installed, then do not concern yourself as I have a requirements.txt file on hand to help you in the supporting GitHub repository:

from numpy import vstack
from pandas import read_csv
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, average_precision_score
from sklearn.metrics import confusion_matrix, recall_score, f1_score
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.utils.data import random_split
import torch
from torch import Tensor
from torch.nn import Linear
from torch.nn import ReLU
from torch.nn import Sigmoid
from torch.nn import Module
from torch.optim import SGD
from torch.nn import BCELoss
from torch.optim import lr_scheduler
from torch.nn.init import kaiming_uniform_
from torch.nn.init import xavier_uniform_
import time
import copy

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Here we are using numpy (for array processing), pandas (for working with data frames and series), sklearn for encoding of labels and building up model metrics, torch utilities for working with the input data, torch tensors and other elements of our MLP stack and the time module for timing how long our training loops take.

The data

The data were are going to be using for this PyTorch tutorial has already been preprocessed and consists of all the fields where I have stripped off the row headers.

This is linked to the thyroid data discussed and contains:

Outcome variable is thyroid_class this is 1=sick and 0=well
The rest of the outcomes relate to measurements taken associated with thyroid disorders and are explained in the documentation.

The link to the dataset is: https://raw.githubusercontent.com/StatsGary/Data/main/thyroid_raw.csv.

The first stage of the process is to take the data and create a PyTorch readable data object. These are called data loaders and tell PyTorch how to work with the data. This will be defined in the next steps, but to read more about data loaders, see the official tutorial site: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html.

Step one – Building our first PyTorch component – DataLoaders

This is where things get interesting and we will give chunk by chunk into what is happening under the hood.

Creating the data loader to pull in CSV files

Firstly we need to create a dataset class with one input Dataset – this is a specific PyTorch module that works with various types of data. Because we have tabular data, we will need to declare a reader to read in the file from the link above (the raw data stored on GitHub) and then we will do some conversions:

class ThryoidCSVDataset(Dataset):
    #Constructor for initially loading
    def __init__(self,path):
        df = read_csv(path, header=None)
        # Store the inputs and outputs
        self.X = df.values[:, :-1]
        self.y = df.values[:, -1] #Assuming your outcome variable is in the first column
        self.X = self.X.astype('float32')
        # Label encode the target as values 1 and 0 or sick and not sick
        self.y = LabelEncoder().fit_transform(self.y)
        self.y = self.y.astype('float32')
        self.y = self.y.reshape((len(self.y), 1))

So there are a number of things to explain here:

We have declared a class which will serve as our blueprint for all other types of data we may want to feed in. You could plug and play this class into your own project and it would work. In this class we pass in an input called Dataset, as I have already explained, this is a special class for dealing with PyTorch objects.
Next we declare a function using def __init__(). What this means is that when our class is used the first things the class needs are the arguments contained in the function block. Essentially saving, this is what I need when the class is initialised. Within the parentheses we declare two parameters – self and path (the location of the file).
What is this self parameter? Self represents the instance of the class. By using the “self” we can access the attributes and methods of the class. Essentially, without this you would not be able to access the variables that belong to a class.
Within the init function we do the following:
- Use pandas to read the csv that is passed to the path
- Set our independent variables to the variable X, but unlike normal functions, you need to make sure you use self.X. This assumed that the data frames values are contained in every column but the last one, this slicing notation says df.values[:, :-1] take all my rows (:) and everything(:) up to the last column (-1).
- The same is done for the outcome variable, bit this time we use the select just the last value [:,-1]
- Next we convert the x value to a floating point value
- We then turn sick and negative our outcome variables into numerical representations using the LabelEncoder() function from sklearn. This just says if it is sick then give it a 1 otherwise 0. This is handy for quickly converting your target variables, as Python would not know what to do with categorical features. Finally, we reshape to make sure it is in the right format for working with an n-dimensional array, in this case a tensor.

Magic method (dunder) in our class

Still within the class, we build two more class functions. These are what are known as dunder (magic) methods that every class has in Python:

    def __init__(self,path):
        # Get the number of rows in the dataset
    def __len__(self):
        return len(self.X)
    # Get a row at an index
    def __getitem__(self,idx):
        return [self.X[idx], self.y[idx]]

Here we have overridden the length method to provide the number of rows in our training set and we have used the __getitem__ method to pull the relevant index position (idx) of the relevant item. This will allow us to slice and retrieve elements.

Building our custom method to split our data into train and test splits

From looking at this code do you notice that when we declare self in the function parameter block we always declare it first! Also, tip, it does not have to be self you can call it myself, gary, colin, so long as it is consistent. However, it is standard practice to use the self naming convention.

     def split_data(self, split_ratio=0.2):
        test_size = round(split_ratio * len(self.X))
        train_size = len(self.X) - test_size
        return random_split(self, [train_size, test_size])

People familiar with SKLearn will be aware of the train_test_split function. Here I have replicated the same functionality. This is what the function does:

We have one parameter in this function split_ratio this indicates, as a proportion of the input data, how much data gets reserved for the testing sample. This is the sample we will perform model evaluation on.
This then generates a training set, say if we chose to reserve 20% for testing, the other 80% of the observations will be used for model training – the bit where the model picks out the patterns in the data based on the labelling (this is called supervised learning)
This returns a random split of the data, so each time you run this script you will have varying results, due to the random (let’s use the fancy word stochastic) nature of the splits.

If you made it this far then well done – we have covered quite a bit of ground already. In saying that, we have only really told PyTorch how to read the data. What we do next is tell the network how we want to model the data.

The full implementation of the class we just defined, is below:

# Create a custom CSVDataset loader
class ThyroidCSVDataset(Dataset):
    #Constructor for initially loading
    def __init__(self,path):
        df = read_csv(path, header=None)
        # Store the inputs and outputs
        self.X = df.values[:, :-1]
        self.y = df.values[:, -1] #Assuming your outcome variable is in the first column
        self.X = self.X.astype('float32')
        # Label encode the target as values 1 and 0 or sick and not sick
        self.y = LabelEncoder().fit_transform(self.y)
        self.y = self.y.astype('float32')
        self.y = self.y.reshape((len(self.y), 1))

    # Get the number of rows in the dataset
    def __len__(self):
        return len(self.X)
    # Get a row at an index
    def __getitem__(self,idx):
        return [self.X[idx], self.y[idx]]

    # Create custom class method - instead of dunder methods
    def split_data(self, split_ratio=0.2):
        test_size = round(split_ratio * len(self.X))
        train_size = len(self.X) - test_size
        return random_split(self, [train_size, test_size])

Step two – defining our multi-layer perceptron ANN

Still with me? If you are then thanks, if not come back after a hot coffee, or tea (so British of me!).

A wee bit of theory

We are going to define a multilayer perceptron class, and we will implement a forward message passing layer, we call this a feed forward structure:

The idea is that we feed our inputs – in this case of independent variables, we then build what are known as hidden layers these layers pass values between them and can start to work out patterns in the data, that as mere mortals we would struggle to analyse. The more layers we expose, the longer the network takes to train due to increase complexity and number of connections. Weights are fed at various stages through the network and then updated through something called back propagation which is normally monitored through an epoch. The network learns through how big an update is made to the weights across the network – this is called learning rate and the loss is optimised and minimised through each pass through the network. How the nodes are activated in the network is controlled by an activation function, which says which nodes to activate and which nodes not to. This network simulates how the human cognitive process learns.

I have reached the end of the theory part of this blog. Let’s get on with implementing it.

Implementing the layers in our network

Firstly we will declare our class and specify the number of inputs to the network:

# Create model
class ThyroidMLP(Module):
    def __init__(self, n_inputs):
        super(ThyroidMLP, self).__init__()
        # First hidden layer
        self.hidden1 = Linear(n_inputs, 20)
        kaiming_uniform_(self.hidden1.weight, nonlinearity='relu')
        self.act1 = ReLU()
        # Second hidden layer
        self.hidden2 = Linear(20, 10)
        kaiming_uniform_(self.hidden2.weight, nonlinearity='relu')
        self.act2 = ReLU()
        # Third hidden layer
        self.hidden3 = Linear(10,1)
        xavier_uniform_(self.hidden3.weight)
        self.act3 = Sigmoid()

Lots to explain here:

The class takes in the Module input – this is the PyTorch module responsible for all the neural network implementation in PyTorch.
Our parameters in the __init__() constructor are of course self and n_inputs (this is linked to the number of independent or X variables we have in our model.
Okay, so what the hell is super().__init__()? What this is doing is saying I want to inherit all the parameters from the parent class which in this case will be the ThyroidMLP(Module) parameters. If we didn’t have this, we would have to double type every time we needed to use the Module class. This makes it easier to keep building on top of our original blueprint. The essence of this is that it reduces coding through inheriting parameters. Think of a class of a vehicle there are many types of vehicle, but most would have engines, but your child classes might be a car, motorboat, plane, train – which have massively different behaviours than the parent class.
The next step is to implement the hidden layers in our network, that we briefly touched on in the theory section. Here we are defining a variable called hidden1 that will act as our first hidden layer, with 10 nodes in it. The layer is a linear layer.
We then implement a Kaiming Uniform layer to control how the weights of the network get activated. The activation function we define is the popular relu activation.
We then set our activation function equal to the ReLU (Rectified linear unit) which is a way of suppressing negative weights and allowing for increasing positive weights to activate the nodes in the network.
This process is repeated for another hidden. The important point to note is that the outputs of the previous layer 1 have to be the inputs to the second layer, notice how hidden1 = Linear(inputs=n_inputs, outputs=10) is then passed forward to the next layer in hidden2.
The last layer in the network is a Sigmoid curve – which allows the classification to fall between 0 – 1, meaning the higher the weight, the more likely the outcome is 1 = sick, than not sick, or negative in the labels case. Because we are dealing with a binary prediction, a yes and no, we use Sigmoid, otherwise for multiple class labels, we would use a boundary called Softmax.

Create the forward passing mechanism

You will see this in a number of PyTorch scripts – the weights are passed forward through the network and then the updates to the weights i.e. optimization is done by back propogation. Let’s put the final piece of our network together.

def forward(self, X):
        #Input to the first hidden layer
        X = self.hidden1(X)
        X = self.act1(X)
        # Second hidden layer
        X = self.hidden2(X)
        X = self.act2(X)
        # Third hidden layer
        X = self.hidden3(X)
        X = self.act3(X)
        return X

We use X to keep passing each layer to the next until we have all of our hidden layers and activation functions defined. We have now created our model class, that we will work with in the training loop.

The full implementation of this part of the class is below:

class ThyroidMLP(Module):
    def __init__(self, n_inputs):
        super(ThyroidMLP, self).__init__()
        # First hidden layer
        self.hidden1 = Linear(n_inputs, 20)
        kaiming_uniform_(self.hidden1.weight, nonlinearity='relu')
        self.act1 = ReLU()
        # Second hidden layer
        self.hidden2 = Linear(20, 10)
        kaiming_uniform_(self.hidden2.weight, nonlinearity='relu')
        self.act2 = ReLU()
        # Third hidden layer
        self.hidden3 = Linear(10,1)
        xavier_uniform_(self.hidden3.weight)
        self.act3 = Sigmoid()

    def forward(self, X):
        #Input to the first hidden layer
        X = self.hidden1(X)
        X = self.act1(X)
        # Second hidden layer
        X = self.hidden2(X)
        X = self.act2(X)
        # Third hidden layer
        X = self.hidden3(X)
        X = self.act3(X)
        return X

The fun part is coming up, definition of our training loop.

Step three – the training loop

This is how the model will train and update.

These can be as complex or simple as you want to make them. I have tried to aim it in the middle, as overly simplified would not be useful, and to complicated might fry your brain at this stage, especially if you are a beginner coming at it.

# Create training loop based off our custom class
def train_model(train_dl, model, epochs=100, lr=0.01, momentum=0.9, save_path='thyroid_best_model.pth'):
    # Define your optimisation function for reducing loss when weights are calculated 
    # and propogated through the network
    start = time.time()
    criterion = BCELoss()
    optimizer = SGD(model.parameters(), lr=lr, momentum=momentum)
    loss = 0.0

    for epoch in range(epochs):
        print('Epoch {}/{}'.format(epoch+1, epochs))
        print('-' * 10)
        model.train()
        # Iterate through training data loader
        for i, (inputs, targets) in enumerate(train_dl):
            optimizer.zero_grad()
            outputs = model(inputs)
            _, preds = torch.max(outputs.data,1) #Get the class labels
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()
        torch.save(model, save_path)
    time_delta = time.time() - start
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_delta // 60, time_delta % 60
    ))
    
    return model

Here we go:

We have 6 parameters to the model:
- train_dl – this is the dataloader we created earlier
- model – the model to be trained
- Optional parameters:
  - epochs – how many times do we want to feed through and back propagate errors (this is an optional parameter as the default is set to 100)
  - lr (learning rate) – at what rate the model learns at – too high a value and it misses key updates, too low and it can take forever. This is where the art comes into Machine Learning.
  - momentum – is used to speed up training
  - save_path – this will be the serialised PyTorch (.pth) file format. All PyTorch models are saved with this postfix.
The start parameter starts a timer, as we want to time how long our loop takes
The criterion is how the loss will be calculated. Here we will use binary cross entropy (BCE Loss) for how the loss is computed
The optimiser – how the loss is minimised via a process called gradient descent. We will use the random gradient descent, or stochastic gradient descent for this
We default the loss to zero at the start to initialise the variable

The next step is to create the loop to loop through each epoch and start to train the model:

We use a print statement to print the current epoch
We set the model to train (model.train()) – this kicks off the training of the model
We then iterate through our dataloader:
- the optimiser is set to zero gradients, this is important to do, as the gradients can stay in cache and eat up memory
- we pass the inputs into our model at every epoch
- We use: _, preds = torch.max(outputs.data,1) to get the class labels from the outputs variable – which could be renamed to yhat – these are the predictions
- The loss is passed to our chosen criterion – in this instance the binary cross entropy loss we discussed above
- We then set the loss to back propagate through the network updating the weights as it goes i.e. loss.backward()
- Then we need to tell the optimiser to start optimising the next step
- At each step we save the model to a model path using torch.save – here you would want to really implement an if statement to check if this is the best model to save

Note: to use the GPU you would need to cast model.to(device) or model.to(“cuda”) for parallel processing.

There we go we have created the training loop. Next we need to decide how we are going to evaluate the model.

Step four – evaluating the performance of our network

The next function will be used to evaluate our PyTorch model to see if it is any good, or if we have been wasting our time for the last 20 minutes.

The function for this is contained hereunder, and as always, I will add my view of what is happening in each step:

import math
def evaluate_model(test_dl, model, beta=1.0):
    preds = []
    actuals = []

    for (i, (inputs, targets)) in enumerate(test_dl):
        #Evaluate the model on the test set
        yhat = model(inputs)
        #Retrieve a numpy weights array
        yhat = yhat.detach().numpy()
        # Extract the weights using detach to get the numerical values in an ndarray, instead of tensor
        actual = targets.numpy()
        actual = actual.reshape((len(actual), 1))
        # Round to get the class value i.e. sick vs not sick
        yhat = yhat.round()
        # Store the predictions in the empty lists initialised at the start of the class
        preds.append(yhat)
        actuals.append(actual)
    
    # Stack the predictions and actual arrays vertically
    preds, actuals = vstack(preds), vstack(actuals)
    #Calculate metrics
    cm = confusion_matrix(actuals, preds)
    # Get descriptions of tp, tn, fp, fn
    tn, fp, fn, tp = cm.ravel()
    total = sum(cm.ravel())
    
    metrics = {
        'accuracy': accuracy_score(actuals, preds),
        'AU_ROC': roc_auc_score(actuals, preds),
        'f1_score': f1_score(actuals, preds),
        'average_precision_score': average_precision_score(actuals, preds),
        'f_beta': ((1+beta**2) * precision_score(actuals, preds) * recall_score(actuals, preds)) / (beta**2 * precision_score(actuals, preds) + recall_score(actuals, preds)),
        'matthews_correlation_coefficient': (tp*tn - fp*fn) / math.sqrt((tp+fp)*(tp+fn)*(tn+fp)*(tn+fn)),
        'precision': precision_score(actuals, preds),
        'recall': recall_score(actuals, preds),
        'true_positive_rate_TPR':recall_score(actuals, preds),
        'false_positive_rate_FPR':fp / (fp + tn) ,
        'false_discovery_rate': fp / (fp +tp),
        'false_negative_rate': fn / (fn + tp) ,
        'negative_predictive_value': tn / (tn+fn),
        'misclassification_error_rate': (fp+fn)/total ,
        'sensitivity': tp / (tp + fn),
        'specificity': tn / (tn + fp),
        #'confusion_matrix': confusion_matrix(actuals, preds), 
        'TP': tp,
        'FP': fp, 
        'FN': fn, 
        'TN': tn
    }
    return metrics, preds, actuals

There is a lot to go over here:

Input parameters to the function are the test_dl dataloader, model the PyTorch model aka ThyroidMLP and a beta value for the metrics section
We start the function by initialising empty lists for our preds and our ground truth actual labels
We then loop over the data loader and undertake the following:
- compute the yhat otherwise known as the prediction
- Use the detach().numpy() function to retrieve the weights array from the nd-array
- Set the actual label to a numpy() array
- Reshape the actual to match what we did at the last part of the data loader class
- We then round the predictions
- Finally we append each prediction in the dataloader to the empty list, we also do the same for the actuals
We use multiple assignment to vstack the predictions and the actuals. This essentially just vertically stacks each one of the predictions in a vertical array i.e. row-wise.
We then create a confusion matrix (cm) object from sklearn which computes the confusion matrix from the actuals and the predictions – this will tell us how accurate our model is, etc.
To get each one of the metrics from the array we ravel() the array which takes two separate arrays and produces a continuous flattened arrays. This gives us our True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) statistics.
From this we build a metrics dictionary with key value pairs, this contains:
- Accuracy – the overall model accuracy
- Area under the ROC curve (AU_ROC) – this is the best evaluation metric to use for imbalanced problems like the thyroid dataset
- F1 score – this is the harmonic mean of the precision and recall
- F-Beta – a variance of the F1 score that allows weighting of a beta value. This is useful if you want to weight your model in one direction, over another. Say you are more interested in precision, than recall, then you would set your beta value greater than 1, if you wanted to use it for precision then the lower beta value the better. There are approaches to selecting the optimal threshold that I won’t cover in this tutorial.
- Matthew’s correlation coefficient – represents the correlation between the true values and the actual values. This is sometimes a more robust metric for working with confusion matrices.
- Precision – the quality of the positive prediction made by the model
- Recall – what proportion of positives were identified correctly
- Included are a number of other metrics computed, such as false discovery rate, false positive rate, misclassification error rate and the confusion matrix values themselves e.g. True Positives, False Positives, False Negatives and True Negatives.
The final return statement returns the metrics, preds (predictions) and the actuals (actual values).

Step five – creating the prediction routine

This routine is a relatively simple function to those we have compared above. This routine takes in the row (a new list of data) as well as the relevant model and returns a prediction from the model yhat. Finally, we return a detached numpy array:

def predict(row, model):
    row = Tensor([row])
    yhat = model(row)
    # Get numpy array
    yhat = yhat.detach().numpy()
    return yhat

This will give use a prediction for each input we pass into the model. The next step is to prepare our data ready for working with the model.

Step six – prepare the data to use with our model

We are actually at the point where we will be using our custom model structure to run our model, but first we need an additional helper function to allow us to prepare our dataset:

  def prepare_thyroid_dataset(path):
    dataset = ThyroidCSVDataset(path)
    train, test = dataset.split_data(split_ratio=0.1)
    # Prepare data loaders
    train_dl = DataLoader(train, batch_size=32, shuffle=True)
    test_dl = DataLoader(test, batch_size=1024, shuffle=False)
    return train_dl, test_dl

This function takes the path of where the csv file is stored. In our case this is on GitHub: https://raw.githubusercontent.com/StatsGary/Data/main/thyroid_raw.csv. Then the following happens:

We use our custom dataloader class entitled ThyroidCSVDataset which has all our processing functions built into a custom class
We will then use the split_data method we defined in that class to perform a split on our data
We then create a train and test data loader object – where we are going to shuffle the items in the training set – therefore the results will change on every run.
Finally, we return the data loaders ready to work with in the modelling task.

Using our custom classes to train the model

We have prepared all the groundwork needed to build out supervised machine learning classifier. So let’s step through and call the relevant functions.

Loading the dataset

We will fetch the thyroid dataset from the blob storage. This dataset is highly imbalanced and is a Kaggle classification project, so I would expect the model to do well in predicting the negative examples and not so well in picking up whether a patient is sick. You would need to have some imbalanced label strategies in your back pocket – such as SMOTE and ROSE, but these are beyond the scope of this tutorial.

Let’s load the data:

train_dl, test_dl = prepare_thyroid_dataset('https://raw.githubusercontent.com/StatsGary/Data/main/thyroid_raw.csv')

Training the model

To train the model we will pass one input – this is the number of independent variables, or predictor variables, to use with the model. I know that the thyroid dataset has 26, so this is the input we would choose. The only thing you would need to change is this value:

# Specify the number of input dimensions
model = ThyroidMLP(26)

To configure the training run we will use the train_model function we created:

train_model(train_dl, 
            model, 
            save_path='data/thyroid_model.pth', 
            epochs=150, 
            lr=0.01)

Here I pass:

The train_dl the training data loader used for the model
The model we initialised at the top i.e. ThyroidMLP
A path to save the serialised model file to
Epochs – the number of epochs we want to train over
The Learning rate i.e. the rate at which the model learns

You will see the epochs running and then you will get a print out of the DAG (Directed acyclic graph).

Evaluating how well the model performs with the test data

Previously we built an evaluation model function, which had a long dictionary of model results. We are going to pass our test dataloader to this to get the results:

results = evaluate_model(test_dl, model, beta=1)
model_metrics = results[0]
metrics_df = pd.DataFrame.from_dict(model_metrics, orient='index', columns=['metric'])
metrics_df.index.name = 'metric_type'
metrics_df.reset_index(inplace=True)
metrics_df.to_csv('confusion_matrix_thyroid.csv', index=False)

To decode this:

Uses the evaluate_model function to pass in our test data loader, model and specify a beta rate. I want beta to be balanced, so I set 1 as the threshold
Store the model metrics off the back of the results call to the evaluate model. As the evaluate model has multiple returns, we must specify the zeroth index of the return
The return will be a dictionary type, so we need to use the from_dict function to convert to a data frame and we specify our column to use as metric. This name could be anything.
From this conversion our evaluation metric names are actually stored as rows, so we will pull them from the row into a column, give the column a name and reset the indexes of the column
Finally – I will output these results to a CSV file – using the handy to_csv function.

The results are below:

#                          metric_type      metric
# 0                           accuracy    0.727273
# 1                             AU_ROC    0.497436
# 2                           f1_score    0.096386
# 3            average_precision_score    0.235493
# 4                             f_beta    0.096386
# 5   matthews_correlation_coefficient   -0.008809
# 6                          precision    0.222222
# 7                             recall    0.061538
# 8             true_positive_rate_TPR    0.061538
# 9            false_positive_rate_FPR    0.066667
# 10              false_discovery_rate    0.777778
# 11               false_negative_rate    0.938462
# 12         negative_predictive_value    0.762646
# 13      misclassification_error_rate    0.272727
# 14                       sensitivity    0.061538
# 15                       specificity    0.933333
# 16                                TP   12.000000
# 17                                FP   42.000000
# 18                                FN  183.000000
# 19                                TN  588.000000

As suspected – we have a massively imbalanced datasets, so the MLP is struggling to produce a good classification for the true positives, as the absence of positive labels is apparent.

Let’s poke our model and see what prediction we get, but looking at these results it is most likely to output that the classification of thyroid disease is negative.

Make a prediction against the model

This would be the part where if you were happy you would push the model into production and then new unseen observations would be scored against the model. I will poke it with one observation to see how it performs.

row = [0.8408678952719717,0.7480132415430958,-0.3366221139379705,-0.0938130059640389,-0.1101874782051067,-0.2098160394213988,-0.1260114177378201,-0.1118651062104989,-0.1274917875477927,-0.240146053214037,-0.2574472174396955,-0.0715198539852151,-0.0855764265990022,-0.1493202733578882,-0.0190692517849118,-0.2590488060984638,0.0,-0.1753175780014474,0.0,-0.9782211033008232,0.0,-1.3237957945784953,0.0,-0.6384998731458282,0.0,-1.209042232192488]
yhat = predict(row, model)
print('Predicted: %.3f (class=%d)' % (yhat, yhat.round()))

I have used a row in the dataframe where I know the patient is sick to test the label value of the model. I can see that my suspicions about the imbalance in the model are true:

Predicted: 0.499 (class=0)

I would never deploy this model. My next step would be to try some class rebalancing techniques.

Running the model against balanced dataset

For sake of time – I will use a dataset called Ionsphere that I know is well balanced and will show our model in a better light than this example. We will do the data prep and training in one cell and then poke the model. The ionsphere data is stored here:https://raw.githubusercontent.com/StatsGary/Data/main/ion.csv.

Prepare and train

I will add the ionsphere data in one code cell – this will show how to prepare the data, train the model, evaluate and predict against the model:

# Get the ionsphere data
train_dl, test_dl = prepare_thyroid_dataset('https://raw.githubusercontent.com/StatsGary/Data/main/ion.csv')
# Train the model
# Specify the number of input dimensions
model = ThyroidMLP(34)
# Train the model
train_model(train_dl, model, 
            save_path='data/ionsphere_model.pth', 
            epochs=150, 
            lr=0.01)

The only differences here is that I loaded a different dataset and specified the number of input dimensions differently.

Evaluate the model

I have written a little eval_model wrapper function here, as there are a couple of steps to converting the stored dictionary into a data frame:

# Evaluate the model
def eval_model(test_dl, model, cm_out_name='confusion_mat.csv',
               beta=1, export_index=False):
    results = evaluate_model(test_dl, model, beta)
    model_metrics = results[0]
    metrics_df = pd.DataFrame.from_dict(model_metrics, orient='index', columns=['metric'])
    metrics_df.index.name = 'metric_type'
    metrics_df.reset_index(inplace=True)
    metrics_df.to_csv(cm_out_name, index=export_index)
    print(metrics_df)
    return metrics_df, model_metrics, results

results = eval_model(test_dl, model)
print(results[0])

This returns the metrics data frame, the model metrics as their raw dictionary and the results of the evaluate_model function. Let’s see how our model performs:

#                          metric_type     metric
# 0                           accuracy   0.923810
# 1                             AU_ROC   0.882353
# 2                           f1_score   0.946667
# 3            average_precision_score   0.898734
# 4                             f_beta   0.946667
# 5   matthews_correlation_coefficient   0.829016
# 6                          precision   0.898734
# 7                             recall   1.000000
# 8             true_positive_rate_TPR   1.000000
# 9            false_positive_rate_FPR   0.235294
# 10              false_discovery_rate   0.101266
# 11               false_negative_rate   0.000000
# 12         negative_predictive_value   1.000000
# 13      misclassification_error_rate   0.076190
# 14                       sensitivity   1.000000
# 15                       specificity   0.764706
# 16                                TP  71.000000
# 17                                FP   8.000000
# 18                                FN   0.000000
# 19                                TN  26.000000

This model is much more balanced, and much more unrealistic than the first scenario presented, however it makes for good practice in implemeting the model with different datasets.

Predict with the model

The final step we will make a prediction with the model. I will choose a row that I know should be a positive class and let’s see how our model performs:

# Make prediction against model
row = [1,0,1,-0.18829,0.93035,-0.36156,-0.10868,-0.93597,1,-0.04549,0.50874,-0.67743,0.34432,-0.69707,-0.51685,-0.97515,0.05499,-0.62237,0.33109,-1,-0.13151,-0.45300,-0.18056,-0.35734,-0.20332,-0.26569,-0.20468,-0.18401,-0.19040,-0.11593,-0.16626,-0.06288,-0.13738,-0.02447]
yhat = predict(row, model)
print('Predicted: %.3f (class=%d)' % (yhat, yhat.round()))

Predicted: 0.992 (class=1)

This model does so much better at predicting the right class label.

That is that! You have done excellent!

That is it! You have reached the end of this tutorial. I hope by working through this you feel confident to implement your own PyTorch module, or just use this code for your projects.

Please reach out if you need any help adapting this code. I have really enjoyed putting this together and I continue to develop in this toolset, as I love the flexibility.

All there is left to say is:

To leave a comment for the author, please follow the link and comment on their blog: Python – Hutsons-hacks .

Want to share your content on python-bloggers? click here.

PyTorch binary classification 예제 - PyTorch binary classification yeje

The dataset

Before we start – where is the code?

The packages

The data

Step one – Building our first PyTorch component – DataLoaders

Creating the data loader to pull in CSV files

Magic method (dunder) in our class

Building our custom method to split our data into train and test splits

Step two – defining our multi-layer perceptron ANN

A wee bit of theory

Implementing the layers in our network

Create the forward passing mechanism

Step three – the training loop

Step four – evaluating the performance of our network

Step five – creating the prediction routine

Step six – prepare the data to use with our model

Using our custom classes to train the model

Loading the dataset

Training the model

Evaluating how well the model performs with the test data

Make a prediction against the model

Running the model against balanced dataset

Prepare and train

Evaluate the model

Predict with the model

That is that! You have done excellent!

관련 게시물

광고하는

최근 소식

광고하는

포퓰러

광고하는

에 대한

합법적인

돕다

사회의