Artigo Acesso aberto Revisado por pares

A Fine‐Grained Flash‐Memory Fragment Recognition Approach for Low‐Level Data Recovery

2022; Institution of Engineering and Technology; Volume: 31; Issue: 4 Linguagem: Inglês

10.1049/cje.2020.00.206

ISSN

2075-5597

Autores

Li ZHANG, Shengang Hao, Quanxin ZHANG,

Tópico(s)

Advanced Data Storage Technologies

Resumo

Chinese Journal of ElectronicsVolume 31, Issue 4 p. 732-740 Information Security and CryptologyOpen Access A Fine-Grained Flash-Memory Fragment Recognition Approach for Low-Level Data Recovery Li ZHANG, Corresponding Author Li ZHANG nythhsg@163.com Department of Media Engineering, Zhejiang University of Media and Communication, Hangzhou, 310018 ChinaSearch for more papers by this authorShengang HAO, Corresponding Author Shengang HAO jsjxjw@sina.com School of Computer Science, Beijing Institute of Technology, Beijing, 100081 ChinaSearch for more papers by this authorQuanxin ZHANG, Corresponding Author Quanxin ZHANG zhangqx@bit.edu.cn School of Computer Science, Beijing Institute of Technology, Beijing, 100081 ChinaSearch for more papers by this author Li ZHANG, Corresponding Author Li ZHANG nythhsg@163.com Department of Media Engineering, Zhejiang University of Media and Communication, Hangzhou, 310018 ChinaSearch for more papers by this authorShengang HAO, Corresponding Author Shengang HAO jsjxjw@sina.com School of Computer Science, Beijing Institute of Technology, Beijing, 100081 ChinaSearch for more papers by this authorQuanxin ZHANG, Corresponding Author Quanxin ZHANG zhangqx@bit.edu.cn School of Computer Science, Beijing Institute of Technology, Beijing, 100081 ChinaSearch for more papers by this author First published: 01 July 2022 https://doi.org/10.1049/cje.2020.00.206 This work was supported by the National Natural Science Foundation of China (61802210) AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinked InRedditWechat Abstract Data recovery from flash memory in the mobile device can effectively reduce the loss caused by data corruption. Type recognition of data fragment is an essential prerequisite to the low-level data recovery. Previous works in this field classify data fragment based on its file type. Still, the classification efficiency is low, especially when the data fragment is a part of a composite file. We propose a fine-grained approach to classifying data fragment from the low-level flash memory to improve the classification accuracy and efficiency. The proposed method redefines flash-memory-page data recognition problem based on the encoding format of the data segment, and applies a hybrid machine learning algorithm to detect the data type of the flash page. The hybrid algorithm can significantly decompose the given data space and reduce the cost of training. The experimental results show that our method achieves better classification accuracy and higher time performance than the existing methods. I. Introduction With the development of mobile communication technology, various kinds of mobile devices (e.g., smartphones, smart vehicles, and wearable watches) have played essential roles in our daily lives. Users' experiences (e.g., private photos, household expenditure statistics, activity track, and health) produce many sensitive data, and these data are stored in the internal flash memory of mobile devices, which bring about the rapid growth of demand for data security in the field of the mobile devices[1,2]. These personal data are of significant value to users. They will suffer massive loss because of data damage, data tampering, and data losses caused by hardware and software problems[3,4]. In this case, data recovery is the only solution to minimize the losses. Although the method of data recovery based on the file system (e.g., EXT3/4, HFS+) has been well studied for nearly 20 years, the practice of data recovery directly from flash memory chip still faces many challenges. Due to the particular storage characteristics of flash memory, user data are not always stored in sequence, and the worse data fragmentation is, the more extended flash memory is used. As a result, it becomes an urgent issue to recover data from the low-level flash memory fragment. The physical image of flash memory fragment consists of unintelligible binary data block[5], so unknown data formats firstly need to be recognized for these data blocks. Here, the data format refers to the file type of the original file because all data blocks must come from different files. Since there is no valid metadata of file system available in the flash memory fragment, how to quickly identify the file type of data block is a critical issue in data recovery. At present, the existing file type recognition technology can be divided into three categories: the method based on file extension; the method based on the magic number; and the method based on file content features. Since the file extension and magic information are easily tampered with, file type recognition based on file content features has been fully developed[6]. This method requires statistics and analysis of the binary content of the file, obtaining the feature vector, and then identifying the file type using some similarity measurement methods or machine learning related methods[7,8]. However, it is a coarse-grained classification method using file types to classify data fragments, especially when encountering famous composite file fragments, the recognition effect is not ideal[9]. In this paper, we proposed a fine-grained flash fragmentation recognition method based on data type. This method first gives the definition of the data type, and then uses the flash-memory-page data as the basic unit of data type recognition, and uses the mixed machine learning algorithm to realize the data type classification of the flash memory page, and good classification accuracy is achieved. One of the contributions of this paper is to define data types and file types, analyze their differences in flash-memory-page data type recognition and point out the flash data fragment classification based on file types is a coarse-grained classification method. The other contribution is that a fine-grained flash-memory-page data type classification method based on machine learning is proposed. That is, the machine learning is used to identify the data types of the original files located in the flash memory page. The remainder of this paper is organized as follows. Section II presents various related work. In Section III, we give our modeling approach. The experimental results are reported in Section IV. Finally, Section V concludes the paper. II. Related Work Identifying file fragment type based on file content was first proposed by McDaniel M.[10] that uses the byte frequency distribution feature (BFD) or byte frequency correlation feature (BFC) to identify the file type. The average classification accuracy of the two methods is 27% and 46%, respectively. When the file fragment contains the metadata of the file header/tail, the accuracy can reach about 90%[11]. However, it is challenging to obtain complete and well preserved original documents in digital forensics. In addition, detecting the file fragment type from the incomplete file is becoming increasingly important[12]. Therefore, various methods of file fragment recognition have been discussed in the last decade or so. Veenman C. J.[13] selects 28 types of file fragments in separate data sets and uses linear discriminant analysis (LDA) to extract the features such as 1-gram, Shannon entropy, and Kolmogorov complexity. To detect file fragments type, he uses the Fisher criterion function to select the extreme vector as the optimal projective direction, so that different file types have the vast projection distances. The classification accuracy of this method is about 45%. Calhoun W. C.[14] uses a similar feature extraction method to Veenman C. J.[13]. They used small datasets to test four types of files (JPG, BMP, GIF, and PDF). They used two kinds of algorithms detecting file fragment type (Fisher linear discrimination and longest common subsequence) to obtain the high accuracy (about 88.3%). However, their classification accuracy has no comparability because their experiment only includes four types of files, and lacks the file types that are easy to be confused. Axelsson S.[15,16] proposes a method of file fragment type recognition that uses the normalized compression distance (NCD) as a feature. The principle of this way is that the same type of file fragments has relatively small compression distance and that the different kinds of file fragments have relatively large compression distance in the experiment of Ref.[15], the author selects 28 file types from the open dataset and uses the machine learning method of k-nearest neighbors. For different k values, the accuracy of the experimental results varies from 32% to 36%. Beebe et al.[17] conducted a detailed study on file fragmentation classification. They selected 38 file types and analyzed 1-gram, 2-gram, Shannon entropy, Hamming weight, byte frequency distribution, and other features. They choose the support vector machine and compare the classification effect of linear kernel function and the radial basis kernel function. The final experimental results show that the classification effect is the best when they use the combination of 1-gram and 2-gram features and linear kernel functions and the accuracy is 73.45%. From the literature review, we can see that for those fragment recognition methods based on the file type, their classification accuracy rates are not more than 75%, so it is essential to propose a fine-grained fragmentation recognition method to improve the classification accuracy. What is more, the flash memory has become the primary storage device in the computer system nowadays, and the file fragments stored in the flash memory pages are the smallest fragment recognition target. Therefore, we propose a fine-grained flash fragmentation recognition method based on the data type. III. The Proposed Approach In this section, we present the proposed approach based on the new contexts. 1. Differences between data type and file type The data type represents the encoding format of a piece of the data segment and the form of its data organization. Data blocks with the same data type use the same data encoding scheme and data organization form. The file type represents the type of the file that the data segment belongs to, indicating the organization form of data blocks in a file. For example, GZ and ZIP are two different types of compressed files. However, they both use the compression algorithm called "deflate" in a data block. Thus, their data types belong to "deflate." In general, the file type represents the type of a complete file, while an entire file contains one or more data types (see Fig. 1). For the file fragment, we can always infer its data type, but we cannot deduce its file type in most cases. Fig. 1Open in figure viewerPowerPoint Data type and file type 2. Definition of the fine-grained flash-memory-page data recognition problem A page is the smallest storage unit of flash memory[18], so how to recognize the flash-memory-page data type is the critical problem in file recovery and file reorganization of the flash memory. The flash-memory-page data must belong to a file[19], it is a file fragment. File fragment recognition is to identify the type of file piece in an unknown format. The traditional methods try to infer the file type of a file fragment, so their accuracy rate is not high in practice. Based on the difference between the file type and the data type mentioned above, the fine-grained flash-memory-page data recognition is to select the page as the smallest file fragment and to identify its data type. 3. Data type recognition based on decision tree and support vector machine At present, the standard machine learning algorithms for classification problems include naive Bayesian, support vector machine[20], neural network[21], decision tree[22], k-nearest neighbor, Adaboost and Bagging. Each machine learning algorithm has its unique characteristic[23]. For example, the k-nearest neighbor algorithm spends the shortest time to build a classification model, but it takes a long time to judge the output. The support vector machine classifier has the longest construction time because it spends a lot of time to calculating the optimal hyperplane. However, its classification accuracy is high. Bagging[24] is an integrated algorithm. It can avoid the over-fitting problem of weak classifiers but its training time is relatively long and its classification results are related with the parameter selection, while the parameter adjustment process is more complicated. Because of the high accuracy of the support vector machine algorithm and the interpretability of the decision tree algorithm[25], this paper uses the decision tree and the support vector machine to classify and identify the data types of flash-memory-page data. 1) Building the data sets The familiar user files in smartphones mainly include picture files, application files, Web files, audio files, Office files, and SQLite database files. The file types involved are JPG, PNG, BMP, JAVA, XML, MP3, RTF, CSV, ZIP, PPT and DOC, etc. Identification of SQLite database file fragments had been discussed in the author's other paper[26], so this article didn't cover it. JPG, BMP, JAVA, XML, MP3, RTF, and CSV are seven file types, and each one corresponds to a data type, expressed in lowercase letters respectively, such as jpg, bmp, java, xml, mp3, rtf, csv. The remaining PNG file and ZIP[27] file adopt the "deflate" compression algorithm, so their data types correspond to "deflate." The DOC and PPT files may contain a large number of jpg and png images[28] and some OLE objects (its actual data type is "deflate," too), so their data types should include jpg and "deflate." Besides, there are many characters in a DOC document, so wdt is used as a data type. The standard meta information in the DOC is represented as wdm. Similarly, the PPT document's characters are represented as pdt, while the standard control information (meta-information) in the PPT document is represented as pdm. Fig. 2 describes the correspondence between common file types and data types in mobile storage devices. Fig. 2Open in figure viewerPowerPoint The correspondence between 11 file types and 12 data types When building a data set, we need to consider the size of the file fragmentation. Generally speaking, the larger the fragment, the more comprehensive the information is contained, the higher probability the fragment is correctly identified. In the field of smartphone forensics, the flash memory page is the smallest data unit. At present, the page size of mainstream flash memory is 4096 bytes and 8192 bytes. Considering that the minimum allocation units of the standard file systems EXT4 and HFS+ of mobile phones are usually 4096 bytes, the size of the file fragments in this paper is 4096 bytes. 2) Feature extraction Data type recognition of file fragments is a standard classification problem. The feature selection in the classification problem is of significance. It directly affects the feature space's construction and its divisibility, the performance of the classifier[29]. Only the distinguishing features of data types can achieve good classification results. The combination of features have been validated by Sportiello and Zanero[30] in a great deal of research work, such as byte frequency distribution (BFD), rate of change (RoC), multiple bytes vector (MBV), entropy and LZ Complexity (Lempel-Ziv complexity)[31]. In this paper, we use the three features: byte frequency distribution, Shannon entropy, and LZ complexity. Such selection is mainly based on the following considerations: a) The byte frequency distribution is widely used. Most researchers use byte frequency distribution in their experiments, and have a good distinction between file fragment types; b) LZ complexity is used to calculate the number of different substrings in a binary data stream. It has a good effect on the distinction between compressed data and non-compressed data. c) The byte frequency distribution and Shannon entropy can be easily calculated, and the efficiency of feature extraction can be improved. 3) Classifier construction Using these three features, a 258-dimensional vector can be generated. Subsequently, the 258 feature components will be used to build the classifier. In this paper, machine learning algorithms combining decision tree and SVM are used to classify the data types of flash-memory fragments. First, a binary C4.5 algorithm in the decision tree is used to create a first-level classifier for all training set samples, and the steps are as follows: Step 1: The information gain rate for each feature component f i ( i = 1 , 2 , 3 , … , 258 ) is calculated by I G R ( f , v ) = I S − S f < v S I S f < v − S f ≥ v S I S f ≥ v p S f < v log p S f < v + p S f ≥ v log p S f ≥ v where S is a set of the training samples reaching a given node E, f is a feature component and v is its certain value, I ( S ) is the information entropy of a data set S and | S | represents the size of S. Step 2: Select the component f with the max I G R ( f , v ) as the best classification feature of the split node E, and then divide the training samples flowing to E into two branches according to the max I G R ( f , v ) and the samples with f < v are sent to the left-hand branch node, and the samples with f ≥ v are sent to the right-hand branch node. Here, v is the classification feature value corresponding to the max I G R ( f , v ) . Step 3: remove the feature component selected in the previous step, repeat steps 1, 2, 3 in the samples set flowing to each branch node until one of the following conditions is met, and the branch ends. a) The number of samples reaching a branch node is less than the threshold value θ. The value θ will be determined through data training, as described in the next section. b) All the samples which reach a node have the same labels. In this case, the information gain of these samples is 0. c) All the feature components are used up, that is to say, none of them are left after removing the selected feature components. After generating a decision tree, we use the data segment samples that flow to each leaf node as the training set to train the SVM classifier (Fig. 3). In the early stage of the training phase, we set the same parameter values for each leaf node's SVM classifier. After multiple iterations, the parameter values on each SVM classifier will be optimal. We introduce them in detail below. In the validation phase, we firstly feed the test set data into the decision tree. If a given test object X reaches a leaf node containing the samples with the labels y i , X is classified as y i . Otherwise, we categorize it with the SVM classifier associated with the leaf node. Fig. 3Open in figure viewerPowerPoint Mixed classifier architecture of decision tree and SVM SVM is a common two-classifier for sample classification by constructing hyperplane functions[20]. The specific steps are described as follows: a) Suppose a known training set: S = { x 1 , y 1 , … , ( x i , y i ∈ ( X × Y ) m , in which x i ∈ X = R n , y i ∈ Y = { 1 , − 1 } ( i = 1 , 2 , … , m ) and x i is a feature vector. b) Take the appropriate kernel function K and parameter C to construct the optimal solution problem: min 1 2 ∑ i = 1 j ∑ j = 1 m y i y j a i a j K ( x i , x j ) − ∑ j = 1 m a j s . t . ∑ i = 1 m y i a i = 0 , 0 ≤ a i ≤ C , i = 1 , 2 , … , m Get the optimal solution a * = a 1 * , a 2 * , … , a m * T c) Select a positive component of a * ( 0 < a * < C ) , and calculate the threshold: b * = y j − ∑ i = 1 m y i a i * K ( x i − x j ) d) Construction decision function F x = s g n ∑ i = 1 m a i * y i K x , x i + b * In this paper, the value of m is 258. In the whole mixed classifier's training stage, there are two kinds of parameters to be optimized. One is the threshold θ of the leaf node in the decision tree; the other is the parameters related to SVM because the required parameters vary with the different kernel function K. Let σ represents a parameter set of SVM. Their optimal values are determined in the following steps. Firstly, set t = 0 , t represents the time. Train a decision tree with the initial threshold value of θ 0 , then use the samples set of its leaf node to train the SVM parameters σ ∈ Δ , where Φ refers to all possible SVM parameter sets. As a result, we will obtain the validation accuracy p ( θ 0 , σ ) of the mixed classifier. Next, we want to build a mixed classifier with a larger θ value. The training process is as follows[32]: a) Increase t by 1 and set θ t = 2 × θ t - 1 . That is to say, generate a decision tree again using the threshold value θ t instead of θ t - 1 at the next time, which means that their parent nodes are retained and the leaf nodes are removed in each iteration. As a result, these parent nodes at t = 0 become the new leaf nodes at t = 1 . Then, the SVM parameters σ is trained by the sample set on the new leaf node, and we will obtain the validation accuracy p ( θ t , σ ) of the mixed classifier at time t. b) If p θ t , σ − p ( θ t - 1 , σ ) ≥ 0 . 005 is established, and the θt value is less than the total number of training samples, proceed to step 1. c) If p θ t , σ − p θ t - 1 , σ < 0 . 005 is established, then set θ σ = θ t - 1 ; or if the θ t value is greater than or equal to the total number of training samples, then set θ σ = θ t . When we complete the training of the mixed classifier, there are σ o p t = max p θ σ , σ and θ o p t = θ σ o p t Finally, a mixed classifier with SVM parameters σ o p t and a threshold value σ o p t is output. IV. Experimental Results In this section, we describe the data sets used in the experiments and the methods of grouping experiments. Then the experimental results are given and discussed. 1. The data sets All the files used in this experiment are from the Gov Docs1 data set. This data set is an open data set released by Simson Garfinkel et al.[33] in 2009. The reduced Gov Docs1 has a total of 21,000 files, including a variety of common file types, and the downloaded compressed file size is as high as 19.8 GB. When a file is selected, the data fragments are extracted from the file. The data fragment's length is defined as 4096 bytes as a flash page has typically 4096 bytes, so the starting addresses of all pieces in the file are the integral multiple of 4096. It is to adhere to the following rules[33] while extracting the data fragments: a) The fragments avoid coming from the head/tail of a file, because the magic number information is generally included in a file's head and the last piece at a file's tail has usually not 4096 bytes. These two cases will affect the classifier's effect, so they will also be discarded. b) The number of fragments of each data type is at least 10000 from 20 different files. If the fragments' number is too small or they come from the same file, these samples may not be representative. It will lead to the established classification model has an unpredictable deviation. c) For composite documents, the document formats need to be analyzed. Their data fragments should be extracted from different components, and they should be assigned different data types. 2. Design of grouping experiments In this section, three groups of experiments were designed to test the effectiveness of the proposed method. The first experiment is used to find the optimal kernel function and related parameters of SVM. In SVM algorithm, the kernel function usually has four choices: Linear kernel function, polynomial kernel function, radial basis function (RBF), and sigmoid kernel function, and the number of parameters used in each kernel function are different. The second experiment compares our mixed algorithm with three other machine learning algorithms, such as C4.5, SVM, and bagging in terms of accuracy, recall rate, and training time. Finally, the flash fragment recognition method based on the data type is compared with that based on the file type in the third experiment. We analyzed the advantages and disadvantages of various algorithms in different situations. 3. Results of different kernel functions used in the mixed classifiers For SVM algorithm in the mixed classifiers, the linear kernel function can be applied to a linearly separable classification problem. The other three kernel functions such as polynomial kernel function, radial kernel function and sigmoid kernel function are used in a non-linearly separable one. The question of fragmentation classification is generally not strictly linearly separable, and a parameter C called penalty factor needs to be added. Usually, if parameter C is small, it means that the algorithm has a high tolerance for the classification error. On the contrary, if the value of parameter C is large, it means that the penalty is heavy for classification errors. Table 1 shows the test accuracy rates of the mixed classifier when applying four different kernel functions to Gov Docs1 data sets in the first experiments. The initial threshold value of θ0, is set to 1500 heuristically. Table 1. Test accuracy rate of classification under different parameter values and different kernel functions C γ r d Precision(%) Kernel function 32 – – – 87.52 Linear 64 – – – 88.63 Linear 128 – – – 89.48 Linear 256 – – – 89.58 Linear 128 2 0 3 90.37 Polynomial 256 0.008 0 3 43.83 Polynomial 256 2 1 1 90.71 Polynomial 128 0.125 – – 88.75 RBF 256 0.5 – – 90.63 RBF 512 2 – – 90.97 RBF 256 0.001 0.0 – 66.73 Sigmoid 256 0.008 0.05 – 77.2 Sigmoid 256 2 0.1 – 57.9 Sigmoid From the experimental results of Table 1, it can be seen that for the linear kernel function, the test accuracy rate will increase with the C value. However, the accuracy rate becomes stable when the C value rises to a certain extent. For the radial basis function, its classification accuracy rate is the highest among the four different kernel functions. The parameter d in polynomial kernel function has little influence on the experimental results, but the parameter γ has a significant impact on them. The test accuracy rate increases rapidly with γ. For the sigmoid kernel function, the test accuracy rate is low. Even if the parameters such as r (which represents the coefficient of kernel function) and γ are adjusted, the test accuracy rate does not exceed 80%, so it can be inferred that this kernel function is not suitable for data fragment classification. As a result, the most optimal kernel function in the mixed classifiers is the radial basis function, and it is better to choose the large value for the parameter C and γ. For example, C is 512, and γ is 2. 4. Results of four different machine learning methods Applying four different algorithms such as mixed classification, C4.5, bagging, and SVM to Gov Docs1 data sets, the experimental results are shown in Fig. 4. Fig. 4(a) shows four performance indexes such as the validation accuracy, the recall rate, the F-value, the ROC area, and Fig. 4(b) displays separately the training time of four algorithms because their values vary at a large range. Fig. 4Open in figure viewerPowerPoint Comparation of (a) Performance and (b) Training time of the four machine learning algorithms based on the data type To make these methods comparable, we select SVM as the base learner in bagging. Each SVM is trained on 1,500 training samples chosen randomly. All the training are conducted sequentially. We stop when the validation accuracy rate of n SVMs doesn't exceed that of n - 1 SVMs by more than 0.005. Besides, we use the radial basis function and the same parameter values in each SVM classifier. Moreover, the threshold value of θ is determined, and it is 1500. Here, the experiments are performed on an Intel Core i7 CPU 2.0 GHz with an 8GB RAM, and 10-fold cross-validation was used. From the experimental results of Fig. 4, it can be seen that C4.5 algorithm has the shortest training time and the lowest validation accuracy and that our mixed algorithm has the highest validation accuracy and its training time is between that of C4.5 algorithm and that of SVM algorithm, and that the validation accuracy of bagging (ensemble SVM) algorithm is higher than that of SVM algorithm. Still the training time of the former is nearly six times of that of the latter. As a result, the mixed algorithm is more suitable for the classification of data type in the flash fragment because of its high performance and low cost. 5. Results of flash fragmentation recognition based on the file type method In this experiment, we select ten file types as the target of flash fragment recognition. They are JPEG, PNG, BMP, JAVA, XML, MP3, RTF, CSV, PDF, PPT, and DOC, respectively, as described in Section III.3.1). The test steps are similar to those in the first experiment, so they are not described here. However, the rules of extracting the file fragments are not exactly the same as those in the first experiment. They are as following: a) The siz

Referência(s)
Altmetric
PlumX