Bio-inspired Expert System based on Genetic Algorithm for Printer Identification in Forensic Science

Secure printings involve the strategy that printer outcome, named a document, is a successful way to distinguish several features of the printer. These kinds of features, which in turn are printer specific, can easily be utilized for document security. For example, in case of planned forgery, all of us ideally should certainly be ready to recognize the category of printer that was appointed to produce the document [1]. The importance of document security has increased due to the ease of counterfeiting or forgery of documents such as banknotes, official documents, bank check, visa, driver’s licenses, and passports according to the development of computer and printing technologies. Therefore, the ability to embed and extract information in/form printed documents would be desirable for many security applications [2]. Published documents include features of the printing device based on the particular technique employed by producers for inserting the tagging element on the document [3]. The printer identification is closely related to various pattern identification and recognition techniques [4].

measured over small areas such as inside a text character and has also been shown to work across varying font type, font size, and printer consumable age with has high discrimination accuracy.
In contrast, the active technique embeds an extrinsic signature in the printed page. This signature is generated by modulating the process parameters in the printer mechanism to encode identifying information such as date of printing, the printer's serial number, and time of printing [5] [11] [12]. Active technique embeds traceable information into the document, being it imperfections in text or images, or microscopic tracking dots that encode the printer's serial number. These tracking dots are yellow in color printed on a white background, making them appear invisible to the naked eye. However, this technique is reportedly only used with the color laser printer, which limits its application dramatically. A large number of documents don't require color and may be printed using a grayscale facility. Most of printer identification systems in the forensic science employ the passive technique (intrinsic signature) because active techniques modify the printing process parameters that can induce unexpected printing quality. Intrinsic signature is tied directly to the electromechanical properties of the printer; so it is hard to forge or remove. [8] [12]. Also, yellow dots contain encoded information the naked eye can hardly see it [13].
There are many challenges associated with the printer identification task for Arabic manuscript such as [1] [11] [12] [14]: (1) Arabic characters can have more than one shape according to their position in a word : isolated, begin of word, middle of word, and end of word. (2) Several variables can affect the performance: the type of paper, font type, font size, printer consumable age, work with multiple font sizes, and also different characters increases the complexity. (3) Because of the advancing technologies in the world, various image processing tools are available to forge the documentation easily and efficiently; so that the authentication of printed data is a big challenge. (4) Many features of the printed document for printer identification sometimes increase time and reduce the classification accuracy of the recognition system since some of the features may be redundant and nom-informative. An efficient method to solve these difficulties is by utilizing the genetic algorithm (GA) for feature selection. Feature selection can be defined as a process that chooses a minimum subset of features from the original set of features; so that the feature space is optimally reduced according to certain evaluation criteria [14]. GA is now widely applied in science and engineering as adaptive algorithms for optimizing practical problems based on principles of natural selection. Based on the concept of the best fitness value of a GA, optimal features can be easily achieved [15]. This paper focuses on the research of printer identification for Arabic alphabet, which is still a challenging research topic and not extensively explored by researchers. The work presented in this study tries to extract Gray Level Co-occurrence Matrix features (GLCM) from the printed letter '' WOO '' as it is one of the most used alphabets in the Arabic language, and is written completely in any position of the word. The system also explores the optimum feature subset by using bioinspired feature selection technique. For classification, it employs the KNN classifier as one of the most famous neighborhood classifiers in pattern recognition. In this case, an easy and effective way to calculate the classification error rate is by the "leave one out "procedure. The classification accuracy of KNN is considered as the fitness function for GA.
The outline of the remainder of this paper is as follows. Section 2 describes some related work in the printer identification. Section 3 describes the proposed system. Section 4 summaries experimental results, and Section 5 concludes the paper.

II. Method
In this section, we will explain the framework of printer identification system based on image texture analysis. The system utilizes GLCM method for getting the features of a particular printer, then GA is adapted to select the optimal feature set to be used for classification. Fig. 1 shows the block diagram of our printer identification scheme. Each step will explain in details. We have collected our data from 10 different printers of different brands with a different model and serial number. After data collection, the documents are scanned at 1200 dpi with 8 bits/pixel (grayscale) because the high-resolution image appears crisper, and its texture will often be more clear and vibrant. Then all the features have been extracted from the isolated character ‫."و"‬

Character Extraction
This step extracts the ‫و"‬ "character in the document image because it is one of the most used alphabets in the Arabic language. It can be noticed from Fig. 2 that different printers print this particular alphabet differently. This character was also chosen based on prior experiments to initially test the accuracy of the identification using a different character.

Image Pre-processing stage
The preprocessing stage is implemented both in the training and testing phases. The purpose of the pre-processing is to get the coordinates of the center of every character in an ideal form and prepares the document image for simple and easy features extraction step [5]. The preprocessing stage follows six steps [18] [19]. a) Conversion to grayscale: Since the attention is only in the grayscale image and not in its color, color information is irrelevant. b) Binarization: The grayscale character is treated by a histogram-based binarization to produce a binary image that contains only 0's and 1's. c) Noise reduction: Once the original image is binarized, the next step is to remove the noise from character image caused during scanning via median filtering method. d) Image cropping: The binary image is segmented from the background to remove the white space surrounding the character using the segmentation method of vertical and horizontal projections. e) Rotation and width normalization: The cropped image is scaled using bi-cubic interpolation to a constant width, keeping the aspect ratio fixed. The positional information of the character is normalized by calculating an angle θ about the centroid (x, y) such that rotating the character by θ brings it back to a uniform baseline. The character's size normalization is important because it establishes a common ground for image comparison. Herein, Taylor's maximization is used for normalization f) Thinning: The goal of thinning is to produce a simplified, but the topologically equivalent image to assist in features extraction and classification.

Feature extraction based on GLCM
We want to be able to determine a set of features that can be used to describe the output documents of the printer. The proposed system treats the scanned document as an "image" and uses image analysis tools to determine the features that characterize the printer. Each printer has different sets of banding features that are dependent upon brand and model. Banding features of a printer caused by electromechanical fluctuations and imperfections are relatively easy to estimate from documents with large mid-tone regions. However, it is difficult to estimate the banding features from the text. GLCM is a widely used texture analysis method especially for stochastic textures to find a feature or set of features that can be measured over smaller regions of the document such as individual text characters. GLCM has also been shown to work across varying font type, font size, printer consumable and printer age [8][12] [21].
The GLCM is a tabulation of how often different combinations of pixel brightness values (gray levels) occur in an image. The advantage of the co-occurrence matrix calculations is that the cooccurring pairs of pixels can be spatially related in various orientations with reference to distance and angular spatial relationships, as on considering the relationship between two pixels at a time [21]. To generate a GLCM, first, we define the number of pixels in the ROI (region of interest), which is the set of all pixels within the printed area of the character. There are a total of 22 features that could be computed from GLCM that is calculattion as: where G is the normalized GLCM , n is the number of the GLCM elements, and p(i,j) represents the number of occurrences of grey levels i and j within the window.    (22) There are several difficulties in the GLCM extraction [5] [12]: (1) The dimension of the GLCM is directly related to its computational drawbacks for features calculation. (2) There is no pre-defined method for selection of the displacement vector and calculating co-occurrence matrices for different values is computationally cost. For a given image, a large number of features can be computed from GLCM. So, a feature selection method must be used to select the most relevant features. Regarding of extracted features, these features are usually either relevant, redundant or irrelevant. The irrelevant feature does not contribute to the learning process and redundant does not add any additional information to the procedure. Redundant features unnecessarily increase the dimensionality of the feature space and are not expected to improve the classification quality. Whereas the relevant features lead to the best performance. So that feature selection is one of the important steps in order to select best features that give a reduced feature set eventually results in high classification accuracy and also improves the efficacy of training dataset [14] [23]. Feature selection is inherently a multi-objective problem with two main objectives of minimizing both the number of features and classification error. In our work, we extracted 22 features from the printed documents of printer dataset. The printer dataset comprises of 1000 images of 10 species of printers. Thus the dimension of the dataset is 1000 x 22. High dimensional feature set could pose a great threat to pattern or image recognition systems. As such, a GA-based feature selection will be used to reduce the number of features needed by the KNN Classifier. A feature subset selection is a map from an m-dimensional feature space (input space) to n-dimensional feature space (output) [24]. GA is an optimization and search technique based on the principles of genetics and natural selection [14]. The five important issues in the GA are chromosome encoding, fitness evaluation, selection mechanisms, genetic operators and criteria to stop the GA [25]. An initial population is created randomly and evaluated using a fitness function. For binary chromosome employed in this work, a gene value '1' depicts that the particular feature indexed by the position of the '1'is selected. If it is '0', the feature is not selected for evaluation of the concerned chromosome. Herein, the tournament selection mechanism is used due to its simplicity, speed, efficiency, and enforces higher selection pressures on the GA (resulting in higher rate of convergence) and makes sure the worst individual does not get into the next generation [24]. In the tournament selection of size 2, two chromosomes are selected from the population and the better of the two chromosomes using fitness ranking is selected. Tournament selection is performed iteratively until the new population is filled up. Crossover and mutation then form the new population (new generation). The idea behind crossover is that the new chromosome may be better than both of the parents if it takes the best characteristics from each of the parents. On the other hand, after the crossover is performed, mutation takes place. This is to prevent falling all solutions in the population into a local optimum of the solved problem. The mutation changes randomly the new offspring. For binary encoding, a few randomly chosen bits are changed from 1 to 0 or 0 to 1 [26]. The fitness of the chromosomes is evaluated using a function commonly referred to as objective function or fitness function [25]. Unlike traditional gradientbased methods, GA's can be used to evolve systems with any kind of fitness measurement functions including those that are non-differentiable, discontinuous. Finding a good fitness measurement can make it easier for GA to evolve a useful system [27]. For a GA to select a subset of features, a fitness function must be defined to evaluate the discriminative capability of each subset of features. In our case, the fitness of each chromosome in the population is evaluated using KNN-based classification error and the cardinality of the selected calculated as [5] [22].
in which α represents the KNN-based classification error, and f N symbols the cardinality of the selected features. The main objective is to achieve the balance between the classification error minimization with a minimum set of features. As the GA iterates, the individuals (combinatorial set of features) in the current population are evaluated, and their fitness is ranked. Individuals with lower fitness have a better chance of surviving into the next generation or mating pool. The iterations involved in running the GA ensures that the GA reduce the error rate and picks the individual with the least (best) fitness value since error rate is reported for each chromosome involved and the smallest of error rate is finally picked up by the GA [24]. The adopted GA configuration parameters are shown in Table 1.
In this, the feature's selection procedure is a process of selecting the optimal features that relies on removing the redundant or unnecessary features from the subset guided by the objective function. After obtaining 22 features, the system utilizes GA-based feature selector using a fitness function that integrates both of accuracy (minimize error classification) and feature reduction (minimize the cardinality of the selected features) to aggregate the feature subsets. Based on these optimal features, the testing time can be reduced and the learned classifiers can be simplified. In the final feature subset, the algorithm will select the optimal features from the traditional 22 features in order to get the highest identification rate. Nearest neighbor search is one of the most supervised popular learning and classification techniques that has been proved to be a simple, powerful recognition algorithm, and it is a learning algorithm [28][29][30]. The KNN is an instance-based classifier that works on the assumption that classification of unknown instances can be identified by relating the unknown to the known instances according to some distance or similarity measure [26]. Given a set of optimal features for each letter ‫"و"‬ in the document, the suggested method employs a 3-Nearest-Neighbor (3NN) classifier. The 3NN classifier is trained with 1000 known feature vectors. The training set is made up of 100 feature vectors from each of the 10 printers listed in Table 2. Each of these feature vectors is independent of each other. To classify an unknown feature vector X, the Euclidean distances between X and all the known feature vectors are obtained. A majority vote among the 3 smallest distances provides the classification result.

III. Result
In order to test the efficiency and validity of the proposed system, the system prototype was implemented in a modular fashion using MATLAB language release R2015b and was ran and tested using a TOSHIBA PC machine with the following features: Intel (R) Core (TM) i3-2350M CPU @ 2.30GHz, and 4.00 GB of RAM, 64-bit Windows 7 ultimate. In this work, we have used 10 different printers of diverse brands with numerous model and serial numbers are shown in Table 2 that are widely used in the digital evidence laboratory. The first step is to scan the document at 1200 dpi with 8 bits/pixel. Next, the Arabic character ‫"و"‬ in the document (12 point size in Time New Roman font) is extracted in a separated image. The training set consists of 1000 different ‫"و"‬ images for different printers, whereas supplementary 100 images, randomly taken from the same document data set, are used for testing during the identification of the printer source mode. For the evaluation of classification results, the accuracy was chosen as a metric [31]  In a typical forensic printer identification scenario, the accuracy rate is the critical factor to decide the effectiveness of the approach. So, the first experiment tests the classification accuracy for each printer using the optimal features set (approximately from 5 to 7 features instead of the preliminary 22 features depending on the number of samples). The adaptive feature selection algorithm is implemented in this study in order to find the most important features that help to reduce the total evaluation time without the loss of accuracy [18]. The detailed confusion matrix is shown in Table 3. Diagonal element shows the correct classification and the rest shows the incorrect classification. Furthermore, Table 4 depicts the detailed confusion matrix for identification with full features. Table 3. A confusion matrix of identification results using optimal feature selection.   It can be observed from these tables that the suggested system has better recall and precision in many printers, especially for printers numbered P3 to P8 as some of these printers differ only in the model number and the serial number. This shows how the proposed system is effective in terms of extracting a set of optimal features that have the ability to accurately distinguish the exact texture features of the printer's documents. After many experiments, the set of optimal features that achieve together higher accuracy are contrast, similarity, mean, diagonal moment, and sum of variance.
The second set of experiments was performed to compare the identification accuracy of the proposed system that employs GA to determine the optimal features and the re-implemented printer identification system introduced in [12] using the same data sets. The results of the presented study revealed that (see Table 5) the use of the 5 optimal features [f2,f5,f7,f11,f15] with 3NN classifier generates a further identification rate improvement of 4% for the same method without feature selection phase (22 feature with the same classifier), and 15% improvement compared with the method that relies on GLCM features and 5NN classifier. The performance improvement comes from the correct identification of printers because of using GA to extract optimal features (discriminative features) with the help of the multi-objective fitness function that mixes both of the recognition error and cardinality of the selected features. Table 5. The identification accuracy rates among different algorithms.

Method
Accuracy rate (%) Suggested method with optimal features 91 Suggested method with full features (3NN) 87 Traditional method using GLCM and 5NN [12] 77.1 The third set of experiments was performed to show how the identification rate of the proposed system depends on the number of samples per printer because if the printer has more enrolled samples, the chance of correct hit increases. The maximum allowed limit of sample documents is 100 per printer and through which they appear different operations on its image, such as font type, font size, rotation, resolution change, and resizing. If the number of samples is above 100 then the returns in performance are however diminishing for every extra sample due to the increase of intraclass printer's variability. In Table 6, as expected, the identification rate increases as the number of samples grows as a result of the increase in inter-class printer's variability. Accuracy rate grows approximately by 2-5% for each increase by 100 of the number of samples in the dataset after 400 samples. To confirm that the selection of the character "Woo-‫"و‬ is the most appropriate letter among the group of letters in the Arabic language for printer identification task, the fourth set of experiments is conducted, and the results are shown in Table 7. In general, the ‫"و"‬ character contains a set of bends and circles through which it can extract a total of unique features that can characterize each printer. Furthermore, Table 8 shows the extent to which the accuracy of printer identification is affected by image resolution. As expected the more resolution the better the accuracy. The precision of the image letter through the resolution improves the extraction of the features that lead to enhancing the accuracy rate. Finally, increasing the number of neighbors in KNN classifier may decrease the identification accuracy as illustrated in Table 5 in addition to the increase the computational cost. Sometimes, the increasing of K in the KNN classifier would lead to overfitting of the training phase.  The complexity degree of the system depends mainly on the number of samples (n) and the number of generation within GA. As it is difficult to compute the complexity of the system in an accurate way because it is built using MATLAB that requires calling many nested functions; the computational time is used to measure this complexity degree. In general, for the online stage, the system requires approximately 50ms to identify the suspected printer depending on the configuration of the used machine (using 5 optimal features). For the offline phase, the system needs more times for feature extraction and feature selection phases and this time is within 400 to 900 seconds depending on the numbers of samples. Overall, the complexity is roughly O(n 2 ), which gives us a chance to discover the opportunity of integrating the system with other tools for an integrated online printer identification mechanism entrenched inside an automated real-time digital forensic system especially fraud and forgery research for printers.

IV. Conclusion
This paper presented an intelligent system for identifying printer source for Arabic cursive language that is suitable for printer's fraud and forgery research in forensic science. The suggested system significantly reduces the GLCM features used by KNN classifier through utilizing GA to select the optimal features set. This optimal set achieves two objectives, one is to minimize the classification error and the other is to reduce the number of employed features. The classification accuracy of KNN is considered as one of the variables inside the fitness function for GA. The KNN classifier is one of the most famous neighborhood classifiers in pattern recognition. The integration between a naïve yet delicate KNN classifier and GA as a simple configurable meta-heuristic search engine for optimal features results in a simple yet accurate identification detector. The experiments revealed that the identification accuracy rate can achieve 91% with optimal features set against 86% for traditional features using the same classifier. Future work will focus on improving the robustness of the method by making it work with all brands of devices and improving its accuracy at a lower resolution.