Preprocessing of Skin Images and Feature Selection for Early Stage of Melanoma Detection using Color Feature Extraction

Article history: Received 29 April 2020 Revised 05 June 2020 Accepted 10 Sep 2020 Preprocessing is an essential part to achieve good segmentation since it affects the feature extraction process. Melanoma have various shapes and their extracted features from image are used for early stage detection. Due to the fact that melanoma is one of dangerous diseases, early detection is required to prevent further phase of cancer from developing. In this paper, we propose a new method to detect cancer on skin images using color feature extraction and feature selection. The default color space of skin images is RGB, then brightness is added to distinguish the normal and darken area on the skin. After that, average filter and histogram equalization are applied as well for attaining a good color intensities which are capable of determining normal skin from suspicious one. Otsu thresholding is utilized afterwards for melanoma segmentation. There are 147 features extracted from segmented images. Those features are reduced using three types of feature selection algorithms: Linear Discriminant Analysis (LDA), Correlation based Feature Selection (CFS), and Relief. All selected features are classified using k-Nearest Neighbor (k-NN). Relief is known to be the best feature selection method among others and the optimal k value is 7 with 10-cross validation with accuracy of 0.835 and 0.845, without and with feature selection respectively. The result indicates that the frameworks is applicable for early skin cancer detection.


Introduction
Melanoma is one type of skin cancers which is risky and responsible for most cancer deaths. Early detection is one way to tackle this issue and reduce the number of sufferers. Based on statistics at HealtGrove in Indonesia, the annual mortality rate per 100,000 person of malignant skin cancer is increased by 49.1% since 1990 with an average of 2.1% per annum [1]. Image processing is a subject field that can be implemented in early detection of melanoma by analyzing the features. Those features are classified into melanoma and non-melanoma. The images captured by dermatoscope digital is easily analyzed by dermatologist [2].
The image processing is applied in this paper and the detection of skin cancer is generally divided into several stages, namely preprocessing, segmentation, feature extraction and classification [3]. Preprocessing is a phase to improve the quality of the image before it is executed in segmentation stage. The quality of the image must be taken care properly in order to make it possible to produce a good classification results. Meanwhile, feature extraction is the method to figure. out new representations of input images used for classifying. In addition, selection of the right features is likely to increase the accuracy of the results. Therefore, feature selection is a compulsory part when building a melanoma detection system.
Several studies related to melanoma detection have previously been carried out. Bhati and Singhal [4] classify skin lesions into malignant (cancerous) and benign (non-cancerous) based on Otsu segmentation methods. Otsu segmentation is a thresholding method for separating between main object and background automatically without entering any parameter. Their research also tries to combine ABCD rules (Asymmetry, Border, Color, Diameter) as detection techniques. The ABCD rule is a technique to detect skin cancer. The researchers used the ABCD rule as features and used Support Vector Machine (SVM) as a classification method [5]. Ram et.al. [6] implemented a Gaussian filter to remove noise at skin image and combined + method as a classification algorithm for melanoma. Previous studies used features such as ABCD rules, textures and color histograms in the classification process. In this paper, we propose a technique to identify suspicious skin with color feature extraction using color moments of its feature. The accurate color information is still in observation and color calibration alone was able to determine the color characteristics [7]. Dealing with the variation of dataset sources, it causes the suspicious skin image have different illumination since they use different devices as well [8]. So, focusing on color feature extraction is the main focus in this paper.
Due to a huge number of extracted color features, so that feature selection or feature reduction is needed. Feature selection chooses several attributes that highly affect the classification result. Features with high relationship to the categories are selected while the remaining ones are removed [9]. The proper feature selection decreases the load of processor because unimportant features are not computed [10]. Some studies have been conducted to examine the use of feature reduction methods. A research uses Linear Discriminant Analysis (LDA), one of feature selection methods, to anticipate the uneven distribution of features and the Fuzzy kNN method that is fixed by the LpNorm method [11]. The results of the study showed that LDA was able to increase the accuracy of melanoma detection by 63.6% compared to the one without LDA. In addition to the LDA method, there is a Principal Component Analysis (PCA) method [12]. In this paper, we apply LDA with improved image enhancement and compare it with other methods of feature selection such as Correlation based Feature Selection (CFS) [13] and Relief [14]. CFS selects characteristics among features using heuristic strategy and chooses features that are highly correlated with the class [15]. Relief is one feature selection methods that makes use of a statistical approach and avoids heuristic search. Relief is recognized as the most successful method in assessing the quality of a feature with its simplicity and effectiveness. In a study conducted by Ambarwari [16], Relief was combined with fuzzy kNN to identify plant species and a successful detection was made with an accuracy above 70%.
In this study, a preprocessing method of skin image is designed by enhancing the quality of image. This paper implements LDA, CFS, and Relief as feature selection methods to select the unique features of suspicious skin image. Considering the advantages of feature selection, we can examine early skin cancer detection better than that of without feature selection. The k-Nearest Neighbor (k-NN) is utilized to classify of two categories of suspicious skin with Euclidean distance as a measurement metric.

A. Dataset
There are 200 skin images consist of two categories. In this paper, we used secondary dataset based on previous research [11,17]. The pixel dimension of skin lesion image is 15 × 15 × 3 in RGB (Red, Green, Blue) color space. There are 100 images for each melanoma and non-melanoma classes. The default color space is RGB with wide variety of intensities and skin color. Fig. 1 depicts the example of melanoma and non-melanoma images, in the left-side and right-side, respectively. In general, the proposed method is drawn in Fig. 2. The first stage is preprocessing images which can distinguish between normal skin and suspicious skin. Secondly is feature extraction that focusing on color features only. By extracted the normalized RGB, RGBr is revealed [7]. It produces 147 features. A huge number of features is not suitable for classifcation since it produces some noises. Therefore, feature selection is required to find out the relationship among features and remove some of which quite far to other features. A few features reduce computation as well since the classification do not need any high dimmensional data. Three kinds of algorithms tested in this paper are LDA (previous method in previous research), CFS, and Relief. Lastly, classification process is applied to allocate an image into one of two available classes.

B. Preprocessing
Preprocessing is a technique to differentiate between normal skin and other suspicious skins. As described in introduction, preprocessing is decisive part since its effect to the feature extraction process. Color transformation is required at the first stage. From RGB color space, it is normalized using RGBr [7]. The formula of color transformation is presented in Equation 1 and described as follows: where , and is normalized color channel transformed from RGB color space. If the dimension of skin images is stated with × , so the value of intensities in each pixel is represented by x and y. Now, there are 3 color channels, , and that is applied for main preprocessing. Fig. 3 explains the general stage of preprocessing used in this paper. Original skin lesion images are resized 0.9 times smaller than primary image. Then brightness addition is also applied since several skin lesion images have darker color. The brightness of whole images dataset is added by an intensities value of 70. Related to quantity, huge number of noises in original images are filtered in order to reduce the area beyond the suspicious skin. In this paper, we use average filtering with kernel size of 13 × 13. By using average filtering, whole pixels are computed based on the average intensities in certain kernel. Basically, the blurred image of extracted skin lesions look more obvious because it seems prominent when comparing to the surrounding area. This technique aims to enhance the quality of images as well [18].
The second phase of image enhancement is histogram equalization. This phase is intended to sharpen the object of images which is suspicious skin. Fig. 3 shows the part of preprocessing phase with image enhancement and the object appears clearly in the middle of skin image with green line boundary. The result of this phase eases the next process of segmentation. Segmentation is the last part of the preprocessing. Otsu thresholding is one of sophisticated algorithms to separate normal skin (background) and object (suspicious skin). Segmented image is displayed in Fig. 3 with the white one is suspicious skin and black is the normal skin [19]. In this paper, preprocessing phase is applied in two categories of dataset, in melanoma and nonmelanoma dataset. After having segmentation, the next phase is obtaining dominant features among dataset. Feature extraction is explained in the following subsection. The peripheral of melanoma/non melanoma (P) drawn by green line Melanoma/non-melanoma area (T) Normal skin area

C. Color Feature Extraction
Segmentation gives the best result of retrieving main object which plays important role of extracting features. Fig. 4 depicts that after segmentation, the main area will be retrieved in its true color space of RGBr. The main suspicious skin is covered with green line boundary. Some of wich is called as the peripheral area (P). The area outside green line is normal skin (N), and the part inside green line is our focus to be taken as its characteristics (T). The feature extraction method is focusing on the way to get information from color space. In this case, we applied RGBr as normalized color space from RGB. Some of color moments extraction is put in this phase as observed as well in the previous research [7,11]. All types of features extracted are explained in Table 1. Based on Table 1, feature types number 1 until 9 represent the extraction which is applied in each color channel Rr, Gr, and Br with color moments extraction, such as average, minimum, maximum, standard deviation, and skewness of intensities. Therefore, there are 3 color channels with 5 features of color moments and a total number of each feature is 15.The total number of extracted features are 147. Since many features are retrieved, there is a need to figure out unique attributes that represent melanoma or non-melanoma images.

D. Feature Selection
An enormous number of extracted features is likely to produce an overfitting condition, in which the number of samples in dataset is not balance between training and testing set [11]. In other sides, a huge number of features make biased results since the patterns of training data are not uniquely identified in the testing data. Feature selection is used to determine the relationship among feature. In this paper we experimented the use of three feature selections, they are LDA, CFS, and Relief. We tried to compare those three features in classification result. Table 2 shows selected feature by means of LDA, CFS, and Relief with the number of featured is significantly reduced up to 2, 22, and 4, respectively. These feature selections were conducted by Weka tools.  Linear Discriminant Analysis (LDA) algorithm transforms features optimally by minimizing the differences ratio between within class variance and maximizing the ratio between class differences [20]. The formula to compute LDA is described in Equation 2. Mean value is calculated in each class and Si is used to find out within class covariance (within covariance). Covariance value in whole class is summed up to form (within-class scatter matrix). The differences between melanoma and non-melanoma classes is calculated as (between-class scatter matrix). Then the eigenvalue λ and eigen vector w are computed to have w as eigenvector value and is involved to calculate the y projection. The y value is obtained by multiplication between the transpose of eigen value of training or testing dataset.

Correlation based Feature Selection (CFS)
Correlation based Feature Selection (CFS) is one of feature selection algorithms that is able to tackle appropriate correlation measure and a heuristic search strategy. CFS is well applied in supervised classification problems [13]. CFS is referred as Pearson's correlation coefficient as explained in Equation 3.
where Rzc is highly correlated class features and other variables, k is number of features, is the average value of between features and other variables, while is the average of inter-correlation between features. CFS is calculated by gaining the the best value of .

Relief
Selection of relief features can be used for data with highly variable attributes, such as regression or classification [21]. The relief feature selection makes the closest variables have the same class while the variable value that is opposite to the nearest class becomes farther away. Selection of good relief features is used to eliminate features that are less relevant [22].

E. k-Nearest Neighbour Classification
Classification is a supervised learning technique that figure out the selected categories of skin melanoma image belongs to. In this paper, k-NN is applied, so that distance metrics is used. Euclidean distance is used and Equation 4 shows the formula to calculate the metric. After having numerical matrix from the training dataset, then testing set is tested using distance metric to the training set. The parameter of k is set by user as repetitive experiment to look for the best k in each feature selection phase.

F. Evaluation using k-fold cross validation
One of validation of quality of classification, k-fold cross validation is a technique to analyze either small or large dataset robustly [23]. K-fold cross validation is one evaluation that uses certain k values to create partitions between training data and test data. The purpose of this k-fold cross validation is to balance the value of the test data and training data, so that there is no overfitting happened. K-fold can ensure that the classification algorithm is applied to the entire data in the dataset and all data in the dataset will be tested as training data and test data.
In theory, the concept of k-fold cross validation is as follows: For example there are 200 sample data in a dataset. If = 10 means there will be 10 partitions which each partition contains 20 data samples. With a comparison of 10:90, we can determine that in the first test, the first partition is the test data while the other partition is the training data. From the first test the accuracy will be figured out. Equation 5 shows the syntax of accuracy.
= (5) where T is a number of correct tested sample toward to training set, while N is the total number of samples in the dataset. Furthermore, in the second test, the second partition is the test data, while the first partition and the third to the tenth partition are training data, from which the accuracy values are taken. The test is carried out up to 10 times according to the number of partitions of the test data observed. After testing the entire partition, the average accuracy of the entire test is taken. The average accuracy ranges from 0 to 1 or from 0% to 100%. If the average accuracy is close to 1 or 100% then the results of testing the system with k-fold cross validation have met the standard.

III. Experimental Result and Discussion
Two types of datasets are observed in this paper, that are melanoma and non-melanoma. All extracted and selected features is experimented in both types with k-NN classification using = 1 until = 100. There are 100 of nearest neighbor tested since there are two classes dataset which for each has 100 samples. There are 147 features extracted from melanoma skin image analysis. After selection there are 2, 22, and 4 features using LDA, CFS, and Relief respectively listed at Table 2. LDA focusing on selecting melanoma area in Rr color channel. Two features extracted using LDA with mean and minimum value of Rr color channel in melanoma area. It is significantly different from CFS that select 22 features considering melanoma area, perimeter of melanoma area, the area of melanoma until its perimeter, perimeter area until normal skin, normal skin, melanoma area with 8 and 16 quantization, and perimeter area with 8 and 16 quantization. Meanwhile in Relief, the selection of feature considering 3 main parts, there are melanoma area, perimeter, and melanoma area until its perimeter. From three feature selections, focusing on selecting melanoma area in Rr color channel is retrieved. It is indicated that Rr color channel is suitable for melanoma image in various shape and lighting condition.
The experimental result observed by using k-fold cross validation as evaluation of various nearest neighbor method in k-NN classification. Each k is evaluated with and without feature selection. The value of k as nearest neighbor is tested 1-100 incrementally. Euclidean distance is used as distance measurement in each sample in dataset using k-NN. Fig. 5 shows the result of accuracy using 10-fold cross validation with and without LDA feature selection. LDA feature selection is used in the previous research and there are two features selected, that are mean an minimum of T in Rr color channel. With euclidean distance, LDA feature selection slightly better accuracy over without feature selection using = 54, 55, 56 . However almost in whole k without feature selection is good with the highest accuracy reach 0.85 for = 6. In this case, LDA only consider the part of melanoma area only in one color channel, thus other information, such as peripheral and normal area condition can not be analyzed. The optimal accuracy value of LDA reach only 0.695. the trend of the graph decreases with the value of k. It represents that the greater k could cope some noises. For complementary some parts of features, trying CFS feature selection is also experimented. By using Euclidean distance, CFS produces 22 features that more complex rather than using LDA. In the Fig. 6 from the average of = 29 until = 90 yield better accuracy comparing to accuracy without feature selection. Even more using = 1, 2, 4 reach accuracy above 0.8 using CFS feature selection. As its principle, CFS select all features based on on its correlation. Therefore, CFS can find more features that have corresponding to each other greater than LDA. CFS applied in kNN with k more than 70 has accuracy lower and equal than 0.7. The greater k produces more noise. It is proven by decreasing the accuracy value in graphs depicted in Fig. 6.
A few features that achieve the best from all feature selection in this paper is Relief as depicted in the Fig. 7. From the whole result, Relief produces the best accuracy in most k nearest neighbor, without depending on small and big of k value. The highest accuracy using Relief is 0.85 using = 19 and 21 which same as without feature selection in = 6. There is no accuracy under 0.7 using Relief feature selection in = 1 until 100. It shows that Relief is stable and still optimal using the various value of nearest neighbor. Relief feature selection could decrease the complexity due to only 4 features is involved but still tackle to recognize the suspicious skin. Relief feature selection could recognize the pattern of melanoma and non-melanoma image. Relief has the best trend applied in various nearest neighbor of k. This feature selection is good as well as CFS, however CFS has significantly decreased accuracy when has the larger of k. Meanwhile LDA has not promising result due to only a bit number of nearest neighbor can achieve better accuracy over without feature selection. Comparing to CFS and Relief, this observation conclude that CFS and Relief has better and stable accuracy in various k rather than LDA. With only 4 features, the number of selected feature in Relief yield good accuracy comparing to CFS which needs 22 features. From 3 feature selections, Relief is satisfy for the whole experiment of k nearest neighbor with few number of features as well as having best accuracy to determine classification of skin cancer, melanoma or not.
In the other hand, the accuracy of classification is not depending on feature selection only, but also finding the appropriate extracted feature. In this paper, feature is extracted from an image with the same dimension in different lighting condition. Good segmentation images are shown in Fig. 9. Besides the different lighting condition in each images, images which have uneven darkness affect the segmentation. For an example as shown in Fig. 9 in the upper right side. When those uneven darkness is extracted, the features also compute the large of assured area. It compounds due to the system need to clip the size of main area more accurate.   Preprocessing skin image leads the problem in this case due to extracted feature is required from segmented image from preprocessing image, in some examples due to the different intensities among images. Fig. 10 and Fig. 11 shows failed segmented in main area of presumed melanoma and non-melanoma area. There 28 images and 26 of melanoma and non-melanoma images which is corrected segmentation and fitted in skin lesion. Table 3 shows the different result between the number suitable segmentation image of skin lesion. From the whole dataset of melanoma and non-melanoma skin images there is 0.72 and 0.74 images which is successfully segmented in good condition and proper area. It means that almost 30% images have wrong segmented area. It can decrease the correlation value among features when selecting the most appropriate features. Comparing of having normal skin images in dataset is needed in order to avoid bias of extracted features when classifying melanoma and non-melanoma skin images. By the side, skin which has thick hair has other method to segment it proportionally, due to for some cases the thick hair in main area that would be extracted is noise. To ease the segmentation process, the hair removal is needed.

Conclusion
Preprocessing skin image for early stage skin cancer and classification are applied in this paper. There are 3 feature selection methods which are observed, LDA, CFS, and Relief. From those 3 methods, Relief is appropriate to classify melanoma images since Relief found the most correlated features very well. There are 147 features extracted from the image and Relief is successfully recognizes 4 main features. KNN classification based on those 4 features yields accuracy of 85% using 10-fold cross validation with k from 19 until 21.
The future works should extend the segmentation phase by considering skin color and skin hair because both objects influence the result of feature extraction. Various feature extractions should be observed since their result are influenced by the segmentation or the feature extraction method itself. In other side, due to the different intensities among skin images, color transformation during segmentation should also be discovered. For example, HSV color space is commonly used for recognizing the skin image or CIELAB color space which is not depending on lighting condition by removing as lighting feature.