Cervical Cancer Prediction using Machine Learning

FREE Online Courses: Enroll Now, Thank us Later!

With this Machine Learning Project, we will be building a cervical cancer prediction system. We used K Nearest Neighbor, SVC, and Logistic Regression for this project.

So, let’s build this system.

Cervical Cancer Prediction System

Over the past few decades, cervical cancer has been one of the leading causes of death worldwide. After lung and breast cancer, it is the third most common type of cancer in women. It can be only detected at an early age and must be cured at that time. The stage describes how cancer has developed and spread throughout a woman’s body. Following a diagnostic test, doctors can determine cancer’s stage. Physicians can administer the proper course of treatment by understanding the proper stage of cervical cancer.

The IFOG (International Federation of Obstetrics and Gynecology) developed/recommended four cervical cancer phases. The doctors determine the stage after they have assessed the tumor and any lymph node or other body-wide dissemination.

According to studies, there are six primary signs and symptoms of cervical cancer: abnormal bleeding during or after menopause, abnormal vaginal discharge, exhaustion, and weight loss, pelvic and abdominal pain, and irritability while urinating. Long-term infection and infection from the human papillomavirus are the two primary causes of cervical cancer (HPV). For this project, we are going to use K Nearest Neighbor and Support Vector Classifier.

K Nearest Neighbor

In K Nearest Neighbor Classification, examples are grouped according to the class of their closest neighbours. Since it is frequently beneficial to consider more than one neighbour, the method is more widely known as k-Nearest Neighbour (k-NN) Classification, where k nearest neighbours are utilized to determine the class. It is also known as Memory-based Classification since the training examples must be available at runtime, i.e., in memory at runtime. Induction is considered a Lazy Learning approach because it is postponed until runtime. As a result of its direct reliance on training examples, classification is often referred to as example-based classification or case-based classification.

Support Vector Classifier

Support Vector Computers, which can be used for pattern recognition and regression, are supervised machine learning machines that are based on statistical learning theory. However, real-world applications typically require more complex models and algorithms (such as neural networks), which makes them much harder to analyse theoretically. A statistical learning theory can identify the factors to consider when learning certain simple algorithms. SVMs can be thought of as existing at the nexus of learning theory and practice. They create models that are both sufficiently complicated (for example, incorporating a broad class of neural networks) and sufficiently straightforward for mathematical analysis. This is because an SVM may be considered a linear method in a large space.

Support Vector Machines are simply dual-expressed linear learning machines that use kernels to map their input vectors to feature space and then compute the best hyperplane there. Applying the mapping, which is the decision function for the ideal hyperplane classifier in dual form, will result in

The SVM algorithm has the advantage of being able to be reduced to a single optimization problem. We have only thought about the categorization case up until now (pattern recognition). You can give a generalization to regression, which is when y ∈ R. The approach here attempts to build a linear function in the feature space such that the training points are nearby by a distance of ε > 0. This can be expressed as a quadratic programming issue using kernels, just like the pattern-recognition instance.

Logistic Regression

The logit, or the natural logarithm of an odds ratio, is the fundamental mathematical idea underlying logistic regression. A 2×2 contingency table serves as the foundation for the simplest logit example. Consider a situation where a child from an inner city school is suggested for remedial reading classes, and the distribution of a binary outcome variable (such as this) is linked with a dichotomous predictor variable (gender). The chi-square test for independence could be used. The calculated value is χ2(1) = 3.43.

In general, logistic regression is a good tool for expressing and testing hypotheses concerning correlations between one or more categorical or continuous predictor variables and a categorical outcome variable. The plot of such data produces two parallel lines, each of which corresponds to a value of the dichotomous outcome in the simplest case of linear regression, one dichotomous predictor X and one dichotomous outcome variable Y.

Due to the dichotomy of outcomes, it is challenging to characterize the two parallel lines using an ordinary least squares regression equation. As an alternative, one may define categories for the predictor and calculate the mean of the outcome variable for each category. The resulting scatter plot of category means will be curved at the ends but will appear linear in the middle, similar to what one would expect to see on a typical scatter plot.

Project Prerequisites

The requirement for this project is Python 3.6 installed on your computer. I have used a Jupyter notebook for this project. You can use whatever you want.
The required modules for this project are –

Numpy(1.22.4) – pip install numpy
Tensorflow(2.9.0) – pip install TensorFlow
OpenCV(4.6.0) – pip install cv2

That’s all we need for our project.

Cervical Cancer Prediction Project

We provide the dataset as well as source code for the cervical cancer prediction project that we will be required later for execution. Please download the cervical cancer prediction project from the following link: Cervical Cancer Prediction Project

Steps to Implement

1. Import the modules and the libraries. For this project, we are importing the libraries numpy, pandas, and sklearn.

from scipy import stats #importing the scipy library
import pandas as pd#importing the pandas module
import numpy as np#importing the numpy library
from sklearn.svm import LinearSVC#importing the Linear SVC
from sklearn.linear_model import LogisticRegression#importing the logistic regression
from sklearn.neighbors import KNeighborsClassifier#importing the k nearest neighbor
from sklearn.tree import DecisionTreeClassifier#importing the Decision Tree classifier
from sklearn.feature_selection import SelectFromModel#importing the select from model

from sklearn.tree import DecisionTreeRegressor#importing the Decision Tree Regressor
from sklearn.model_selection import train_test_split#importing the train test split function
from sklearn.feature_selection import RFE#importing the RFE
from sklearn.preprocessing import MinMaxScaler#importing the min max scalar
from sklearn.feature_selection import SelectKBest#importing the sklearn
from sklearn.feature_selection import f_classif#importing the classif

import seaborn as sns#importing the seaborn library
import plotly#importing the plotly
import matplotlib.pyplot as plt#importing the matplot lib

from sklearn import metrics#importing the metrics
from sklearn.metrics import f1_score#importing the f1 score
from sklearn.model_selection import GridSearchCV#importing the Grid Search CV

import statsmodels.api as sm#importing the sm model
 
dataframe = pd.read_csv("dataset.csv")
 
dataframe.replace("?", np.nan, inplace=True)#replacing the ? with null
dataframe.isna().sum()#checking the number of null values
dataframe= dataframe.apply(pd.to_numeric)#converting the dataframe to numeric values

2. Here we are importing the important columns of our dataset. Then we are doing some pre-processing of our data. Then we are extracting an important dataset.

def rem_unknown(df):
    i = 0
    while i < df.shape[1]:
        if df.count().iloc[i] < df.shape[0]/2: #chekcing the count iloc
            df.drop(axis=1, labels=[df.columns[i]], inplace=True)
            i -= 1
        i += 1
    mean = round(df.mean())
    df.fillna(mean, inplace = True)
    return df
 
dataframe= rem_unknown(dataframe)
dataframe.describe(include = "all")
 
features = (dataframe.iloc[0:858, 0:30])
 
#the four diagnoses are the target variables
hnman= dataframe["Hinselmann"]
schl= dataframe["Schiller"]
ctlogy= dataframe["Citology"]
bpsy= dataframe[“Biopsy"]
 
algorithms = {
    'logistic regression': LogisticRegression(),
    'decision tree': DecisionTreeClassifier()
    'k-nearest neighbor': KNeighborsClassifier(), 
    'support vector machine': LinearSVC(max_iter=1000000),}

3. Here we are loading our dataset and we are dropping some unnecessary components from the dataset. Also, we are splitting our dataset into testing and training datasets.

def split_data(features, target):#defining a function for splitting the data
    X_train, X_test, y_train, y_test = train_test_split(features, target, random_state = 3000)#splitting the data using the train test split function
    return X_train, X_test, y_train, y_test #returning the training and the testing values
 
X_train_hnman, X_test_hnman, y_train_hnman, y_test_hnman = split_data(features, hnman) 
X_train_schl, X_test_schl, y_train_schl, y_test_schl = split_data(features, schl)
X_train_ctlogy, X_test_ctlogy, y_train_ctlogy, y_test_ctlogy = split_data(features, ctlogy)
X_train_bpsy, X_test_bpsy, y_train_bpsy, y_test_bpsy = split_data(features, bpsy)
 
def proces(train, test): #defining the function 
    scaler = MinMaxScaler().fit(train)#creating an instance of min max scalar
   
    X_train = scaler.transform(train)#transforming the training dataset
    X_test = scaler.transform(test)#transforming the testing dataset
 
    return X_train, X_test

4. Here we are preprocessing our dataset.

#preprocess the data
X_train_hnman_s, X_test_hnman_s = process(X_train_hnman, X_test_hnman)
X_train_schl_s, X_test_schl_s = process(X_train_schl, X_test_schl)
X_train_ctlogy_s, X_test_ctlogy_s = process(X_train_ctlogy, X_test_ctlogy)
X_train_bpsy_s, X_test_bpsy_s = process(X_train_bpsy, X_test_bpsy)
 
feat_select_dict = {"UNI" : SelectKBest(score_func=f_classif, k = 3),
                    "MB" : SelectFromModel(DecisionTreeRegressor(random_state = 3000)),
                   "RFE" : RFE(DecisionTreeRegressor(random_state = 3000), n_features_to_select = 3)}
 
def select_fet(feat_select_dict, x_train_data, x_test_data, y_train_data, y_test_data, tar_data):
    train = LogisticRegression().fit(X=x_train_data, y=y_test_data)
    ml_model_all_test = LogisticRegression().fit(X=xtest, y=y_test_data)
    prelim_train_results.loc["Acc_All", target] = ml_model_all_train.score(x_train_data, y_train_data)
    prelim_test_results.loc["Acc_All", target] = ml_model_all_test.score(x_test_data, y_test_data)
    models = []#creating an empty list
    
    for name, method in feat_select_dict.items():#runing a loop
        model = method#giving methods to it
        model.fit(x_train_data, y_test_data)#fitting the training and testing dataset
        xtrain = model.transform(x_train_data)
        xtest = model.transform(x_test_data)
        models.append(model)#appending the model to models
        train = LogisticRegression().fit(X=xtrain_selected, y=y_train_data)
        test = LogisticRegression().fit(X=x_test_data, y=ytest)
        trainresults.loc["Acc_" + name, target] = train.score(x_test_data, y_train_data)
        results.loc["Acc_" + name, target] = test.score(x_test_data, y_test_data)
    return trainresults,results, models
 
trainresults, results, models_hnman = select_fet(feat_select_dict, X_train_hnman_s, X_test_hnman_s, y_train_hnman, y_test_hnman, "Hinselmann")
trainresults, results, models_schl = select_fet(feat_select_dict, X_train_schl_s, X_test_schl_s, y_train_schl, y_test_schl, "Shiller")
trainresults, results, models_ctlogy = select_fet(feat_select_dict, X_train_ctlogy_s, X_test_ctlogy_s, y_train_ctlogy, y_test_ctlogy, "Citology")
trainresults, results, models_bpsy =  select_fet(feat_select_dict, X_train_bpsy_s, X_test_bpsy_s, y_train_bpsy, y_test_bpsy, "Biopsy")

5. Here we are creating our model and then we are passing our training dataset and then we are testing it on testing dataset.

features_list = []#creating a feature list
def extract_features(model):#defining a function for feature extr
    list = [i for i, val in enumerate(model.get_support()) if val] #adding values in the list
    index=0 #taking index=0
    while index < 3:#running a looop of 3 iterations
        print(features.columns[list][index])#printing the features columns
        features_list.append(features.columns[list][index])#appending the features columns to the feature list
        index += 1#incrementing the index.
 
    return features_list #returning a feature list
 
features_list = list(set(features_list)) #creating the feature list
 
 
def select_features_in_data(x, features, columns_list):#defining a function for selecting features in data
    x_selected = pd.DataFrame(x, columns = features)#selecting the feature
    x_selected = x_selected.drop(columns = [col for col in x_selected if col not in columns_list]) #dropping the columns which are not selected from the dataframe
    return x_selected#returning the selected features
 
age_range_0 = np.arange(12, 40)#using the arrange function of numpy
age_range_1 = np.arange(39, 85)#using the arrange function of numpy
 
sex_part_0 = np.arange(0, 5)#using the arrange function of numpy
sex_part_1 = np.arange(5, 29)#using the arrange function of numpy
 
first_sex_0 = np.arange(16)#using the arrange function of numpy
first_sex_1 = np.arange(15, 33)#using the arrange function of numpy
 
num_preg_0 = np.arange(4)#using the arrange function of numpy
num_preg_1 = np.arange(3, 12)#using the arrange function of numpy
 
#CLEANING: clean up strings so they can used as numbers
for columns in selected_data:
    selected_data[columns] = pd.to_numeric(selected_data[columns])
    
def trans_to_binary(data):#defining a function for transformation
    smoke_range = data["Smokes (years)"]
 
    for years in smoke_range:#chekcing the years
        if years < 1:#if years is less than 1
            data["Smokes (years)"].replace(years, 0, inplace=True)#replacing the years value with 0
        else:
            data["Smokes (years)"].replace(years, 1, inplace=True)#othewise replacing the years value with 1
        
    hor_range = data['Hormonal Contraceptives (years)']#storing all the columns in separate list
 
    for years in hor_range:#checking the years in hormonal contraceptive
        if years < 1:#if year less than 1
            data['Hormonal Contraceptives (years)'].replace(years, 0, inplace=True) #replacing the years with 0
        else:
            data['Hormonal Contraceptives (years)'].replace(years, 1, inplace=True)#replacing the years with 1
    return data#returning the data
 
result= trans_to_binary(hypothesis_data)#calling the fucntion for the transformation of the data
result#printing the result

Summary

In this Machine Learning project, we built a cervical cancer prediction. For this project, we have used K nearest Neighbor, SVC, and Logistic Regression. We hope you have learned something from this project.