Cervical Cancer Prediction using Machine Learning
FREE Online Courses: Enroll Now, Thank us Later!
With this Machine Learning Project, we will be building a cervical cancer prediction system. We used K Nearest Neighbor, SVC, and Logistic Regression for this project.
So, let’s build this system.
Cervical Cancer Prediction System
Over the past few decades, cervical cancer has been one of the leading causes of death worldwide. After lung and breast cancer, it is the third most common type of cancer in women. It can be only detected at an early age and must be cured at that time. The stage describes how cancer has developed and spread throughout a woman’s body. Following a diagnostic test, doctors can determine cancer’s stage. Physicians can administer the proper course of treatment by understanding the proper stage of cervical cancer.
The IFOG (International Federation of Obstetrics and Gynecology) developed/recommended four cervical cancer phases. The doctors determine the stage after they have assessed the tumor and any lymph node or other body-wide dissemination.
According to studies, there are six primary signs and symptoms of cervical cancer: abnormal bleeding during or after menopause, abnormal vaginal discharge, exhaustion, and weight loss, pelvic and abdominal pain, and irritability while urinating. Long-term infection and infection from the human papillomavirus are the two primary causes of cervical cancer (HPV). For this project, we are going to use K Nearest Neighbor and Support Vector Classifier.
K Nearest Neighbor
In K Nearest Neighbor Classification, examples are grouped according to the class of their closest neighbours. Since it is frequently beneficial to consider more than one neighbour, the method is more widely known as k-Nearest Neighbour (k-NN) Classification, where k nearest neighbours are utilized to determine the class. It is also known as Memory-based Classification since the training examples must be available at runtime, i.e., in memory at runtime. Induction is considered a Lazy Learning approach because it is postponed until runtime. As a result of its direct reliance on training examples, classification is often referred to as example-based classification or case-based classification.
Support Vector Classifier
Support Vector Computers, which can be used for pattern recognition and regression, are supervised machine learning machines that are based on statistical learning theory. However, real-world applications typically require more complex models and algorithms (such as neural networks), which makes them much harder to analyse theoretically. A statistical learning theory can identify the factors to consider when learning certain simple algorithms. SVMs can be thought of as existing at the nexus of learning theory and practice. They create models that are both sufficiently complicated (for example, incorporating a broad class of neural networks) and sufficiently straightforward for mathematical analysis. This is because an SVM may be considered a linear method in a large space.
Support Vector Machines are simply dual-expressed linear learning machines that use kernels to map their input vectors to feature space and then compute the best hyperplane there. Applying the mapping, which is the decision function for the ideal hyperplane classifier in dual form, will result in
The SVM algorithm has the advantage of being able to be reduced to a single optimization problem. We have only thought about the categorization case up until now (pattern recognition). You can give a generalization to regression, which is when y ∈ R. The approach here attempts to build a linear function in the feature space such that the training points are nearby by a distance of ε > 0. This can be expressed as a quadratic programming issue using kernels, just like the pattern-recognition instance.
Logistic Regression
The logit, or the natural logarithm of an odds ratio, is the fundamental mathematical idea underlying logistic regression. A 2×2 contingency table serves as the foundation for the simplest logit example. Consider a situation where a child from an inner city school is suggested for remedial reading classes, and the distribution of a binary outcome variable (such as this) is linked with a dichotomous predictor variable (gender). The chi-square test for independence could be used. The calculated value is χ2(1) = 3.43.
In general, logistic regression is a good tool for expressing and testing hypotheses concerning correlations between one or more categorical or continuous predictor variables and a categorical outcome variable. The plot of such data produces two parallel lines, each of which corresponds to a value of the dichotomous outcome in the simplest case of linear regression, one dichotomous predictor X and one dichotomous outcome variable Y.
Due to the dichotomy of outcomes, it is challenging to characterize the two parallel lines using an ordinary least squares regression equation. As an alternative, one may define categories for the predictor and calculate the mean of the outcome variable for each category. The resulting scatter plot of category means will be curved at the ends but will appear linear in the middle, similar to what one would expect to see on a typical scatter plot.
Project Prerequisites
The requirement for this project is Python 3.6 installed on your computer. I have used a Jupyter notebook for this project. You can use whatever you want.
The required modules for this project are –
- Numpy(1.22.4) – pip install numpy
- Tensorflow(2.9.0) – pip install TensorFlow
- OpenCV(4.6.0) – pip install cv2
That’s all we need for our project.
Cervical Cancer Prediction Project
We provide the dataset as well as source code for the cervical cancer prediction project that we will be required later for execution. Please download the cervical cancer prediction project from the following link: Cervical Cancer Prediction Project
Steps to Implement
1. Import the modules and the libraries. For this project, we are importing the libraries numpy, pandas, and sklearn.
from scipy import stats #importing the scipy library import pandas as pd#importing the pandas module import numpy as np#importing the numpy library from sklearn.svm import LinearSVC#importing the Linear SVC from sklearn.linear_model import LogisticRegression#importing the logistic regression from sklearn.neighbors import KNeighborsClassifier#importing the k nearest neighbor from sklearn.tree import DecisionTreeClassifier#importing the Decision Tree classifier from sklearn.feature_selection import SelectFromModel#importing the select from model from sklearn.tree import DecisionTreeRegressor#importing the Decision Tree Regressor from sklearn.model_selection import train_test_split#importing the train test split function from sklearn.feature_selection import RFE#importing the RFE from sklearn.preprocessing import MinMaxScaler#importing the min max scalar from sklearn.feature_selection import SelectKBest#importing the sklearn from sklearn.feature_selection import f_classif#importing the classif import seaborn as sns#importing the seaborn library import plotly#importing the plotly import matplotlib.pyplot as plt#importing the matplot lib from sklearn import metrics#importing the metrics from sklearn.metrics import f1_score#importing the f1 score from sklearn.model_selection import GridSearchCV#importing the Grid Search CV import statsmodels.api as sm#importing the sm model dataframe = pd.read_csv("dataset.csv") dataframe.replace("?", np.nan, inplace=True)#replacing the ? with null dataframe.isna().sum()#checking the number of null values dataframe= dataframe.apply(pd.to_numeric)#converting the dataframe to numeric values
2. Here we are importing the important columns of our dataset. Then we are doing some pre-processing of our data. Then we are extracting an important dataset.
def rem_unknown(df): i = 0 while i < df.shape[1]: if df.count().iloc[i] < df.shape[0]/2: #chekcing the count iloc df.drop(axis=1, labels=[df.columns[i]], inplace=True) i -= 1 i += 1 mean = round(df.mean()) df.fillna(mean, inplace = True) return df dataframe= rem_unknown(dataframe) dataframe.describe(include = "all") features = (dataframe.iloc[0:858, 0:30]) #the four diagnoses are the target variables hnman= dataframe["Hinselmann"] schl= dataframe["Schiller"] ctlogy= dataframe["Citology"] bpsy= dataframe[“Biopsy"] algorithms = { 'logistic regression': LogisticRegression(), 'decision tree': DecisionTreeClassifier() 'k-nearest neighbor': KNeighborsClassifier(), 'support vector machine': LinearSVC(max_iter=1000000),}
3. Here we are loading our dataset and we are dropping some unnecessary components from the dataset. Also, we are splitting our dataset into testing and training datasets.
def split_data(features, target):#defining a function for splitting the data X_train, X_test, y_train, y_test = train_test_split(features, target, random_state = 3000)#splitting the data using the train test split function return X_train, X_test, y_train, y_test #returning the training and the testing values X_train_hnman, X_test_hnman, y_train_hnman, y_test_hnman = split_data(features, hnman) X_train_schl, X_test_schl, y_train_schl, y_test_schl = split_data(features, schl) X_train_ctlogy, X_test_ctlogy, y_train_ctlogy, y_test_ctlogy = split_data(features, ctlogy) X_train_bpsy, X_test_bpsy, y_train_bpsy, y_test_bpsy = split_data(features, bpsy) def proces(train, test): #defining the function scaler = MinMaxScaler().fit(train)#creating an instance of min max scalar X_train = scaler.transform(train)#transforming the training dataset X_test = scaler.transform(test)#transforming the testing dataset return X_train, X_test
4. Here we are preprocessing our dataset.
#preprocess the data X_train_hnman_s, X_test_hnman_s = process(X_train_hnman, X_test_hnman) X_train_schl_s, X_test_schl_s = process(X_train_schl, X_test_schl) X_train_ctlogy_s, X_test_ctlogy_s = process(X_train_ctlogy, X_test_ctlogy) X_train_bpsy_s, X_test_bpsy_s = process(X_train_bpsy, X_test_bpsy) feat_select_dict = {"UNI" : SelectKBest(score_func=f_classif, k = 3), "MB" : SelectFromModel(DecisionTreeRegressor(random_state = 3000)), "RFE" : RFE(DecisionTreeRegressor(random_state = 3000), n_features_to_select = 3)} def select_fet(feat_select_dict, x_train_data, x_test_data, y_train_data, y_test_data, tar_data): train = LogisticRegression().fit(X=x_train_data, y=y_test_data) ml_model_all_test = LogisticRegression().fit(X=xtest, y=y_test_data) prelim_train_results.loc["Acc_All", target] = ml_model_all_train.score(x_train_data, y_train_data) prelim_test_results.loc["Acc_All", target] = ml_model_all_test.score(x_test_data, y_test_data) models = []#creating an empty list for name, method in feat_select_dict.items():#runing a loop model = method#giving methods to it model.fit(x_train_data, y_test_data)#fitting the training and testing dataset xtrain = model.transform(x_train_data) xtest = model.transform(x_test_data) models.append(model)#appending the model to models train = LogisticRegression().fit(X=xtrain_selected, y=y_train_data) test = LogisticRegression().fit(X=x_test_data, y=ytest) trainresults.loc["Acc_" + name, target] = train.score(x_test_data, y_train_data) results.loc["Acc_" + name, target] = test.score(x_test_data, y_test_data) return trainresults,results, models trainresults, results, models_hnman = select_fet(feat_select_dict, X_train_hnman_s, X_test_hnman_s, y_train_hnman, y_test_hnman, "Hinselmann") trainresults, results, models_schl = select_fet(feat_select_dict, X_train_schl_s, X_test_schl_s, y_train_schl, y_test_schl, "Shiller") trainresults, results, models_ctlogy = select_fet(feat_select_dict, X_train_ctlogy_s, X_test_ctlogy_s, y_train_ctlogy, y_test_ctlogy, "Citology") trainresults, results, models_bpsy = select_fet(feat_select_dict, X_train_bpsy_s, X_test_bpsy_s, y_train_bpsy, y_test_bpsy, "Biopsy")
5. Here we are creating our model and then we are passing our training dataset and then we are testing it on testing dataset.
features_list = []#creating a feature list def extract_features(model):#defining a function for feature extr list = [i for i, val in enumerate(model.get_support()) if val] #adding values in the list index=0 #taking index=0 while index < 3:#running a looop of 3 iterations print(features.columns[list][index])#printing the features columns features_list.append(features.columns[list][index])#appending the features columns to the feature list index += 1#incrementing the index. return features_list #returning a feature list features_list = list(set(features_list)) #creating the feature list def select_features_in_data(x, features, columns_list):#defining a function for selecting features in data x_selected = pd.DataFrame(x, columns = features)#selecting the feature x_selected = x_selected.drop(columns = [col for col in x_selected if col not in columns_list]) #dropping the columns which are not selected from the dataframe return x_selected#returning the selected features age_range_0 = np.arange(12, 40)#using the arrange function of numpy age_range_1 = np.arange(39, 85)#using the arrange function of numpy sex_part_0 = np.arange(0, 5)#using the arrange function of numpy sex_part_1 = np.arange(5, 29)#using the arrange function of numpy first_sex_0 = np.arange(16)#using the arrange function of numpy first_sex_1 = np.arange(15, 33)#using the arrange function of numpy num_preg_0 = np.arange(4)#using the arrange function of numpy num_preg_1 = np.arange(3, 12)#using the arrange function of numpy #CLEANING: clean up strings so they can used as numbers for columns in selected_data: selected_data[columns] = pd.to_numeric(selected_data[columns]) def trans_to_binary(data):#defining a function for transformation smoke_range = data["Smokes (years)"] for years in smoke_range:#chekcing the years if years < 1:#if years is less than 1 data["Smokes (years)"].replace(years, 0, inplace=True)#replacing the years value with 0 else: data["Smokes (years)"].replace(years, 1, inplace=True)#othewise replacing the years value with 1 hor_range = data['Hormonal Contraceptives (years)']#storing all the columns in separate list for years in hor_range:#checking the years in hormonal contraceptive if years < 1:#if year less than 1 data['Hormonal Contraceptives (years)'].replace(years, 0, inplace=True) #replacing the years with 0 else: data['Hormonal Contraceptives (years)'].replace(years, 1, inplace=True)#replacing the years with 1 return data#returning the data result= trans_to_binary(hypothesis_data)#calling the fucntion for the transformation of the data result#printing the result
Summary
In this Machine Learning project, we built a cervical cancer prediction. For this project, we have used K nearest Neighbor, SVC, and Logistic Regression. We hope you have learned something from this project.