Text segmentation of health examination item based on character statistics and information measurement

Artigo Acesso aberto Revisado por pares

Text segmentation of health examination item based on character statistics and information measurement

2018; Institution of Engineering and Technology; Volume: 3; Issue: 1 Linguagem: Inglês

10.1049/trit.2018.0005

ISSN

2468-6557

Autores

Hui An, Dahui Wang, Zhigeng Pan, Meiling Chen, Xinting Wang,

Tópico(s)

Traditional Chinese Medicine Studies

Resumo

CAAI Transactions on Intelligence TechnologyVolume 3, Issue 1 p. 28-32 Research ArticleOpen Access Text segmentation of health examination item based on character statistics and information measurement Hui An, Hui An DigitalMedia & Interaction Research Center, Hangzhou Normal University, Wenzhou People's Hospital, Wenzhou, 325000 People's Republic of China Department of Health Examination, Hangzhou Normal University, Hangzhou, People's Republic of ChinaSearch for more papers by this authorDahui Wang, Dahui Wang DigitalMedia & Interaction Research Center, Hangzhou Normal University, Wenzhou People's Hospital, Wenzhou, 325000 People's Republic of China Contributed equally to this paperSearch for more papers by this authorZhigeng Pan, Corresponding Author Zhigeng Pan 443922077@qq.com DigitalMedia & Interaction Research Center, Hangzhou Normal University, Wenzhou People's Hospital, Wenzhou, 325000 People's Republic of China Institute of Industrial VR, Foshan University, Guangdong, People's Republic of ChinaSearch for more papers by this authorMeiling Chen, Meiling Chen DigitalMedia & Interaction Research Center, Hangzhou Normal University, Wenzhou People's Hospital, Wenzhou, 325000 People's Republic of ChinaSearch for more papers by this authorXinting Wang, Xinting Wang DigitalMedia & Interaction Research Center, Hangzhou Normal University, Wenzhou People's Hospital, Wenzhou, 325000 People's Republic of ChinaSearch for more papers by this author Hui An, Hui An DigitalMedia & Interaction Research Center, Hangzhou Normal University, Wenzhou People's Hospital, Wenzhou, 325000 People's Republic of China Department of Health Examination, Hangzhou Normal University, Hangzhou, People's Republic of ChinaSearch for more papers by this authorDahui Wang, Dahui Wang DigitalMedia & Interaction Research Center, Hangzhou Normal University, Wenzhou People's Hospital, Wenzhou, 325000 People's Republic of China Contributed equally to this paperSearch for more papers by this authorZhigeng Pan, Corresponding Author Zhigeng Pan 443922077@qq.com DigitalMedia & Interaction Research Center, Hangzhou Normal University, Wenzhou People's Hospital, Wenzhou, 325000 People's Republic of China Institute of Industrial VR, Foshan University, Guangdong, People's Republic of ChinaSearch for more papers by this authorMeiling Chen, Meiling Chen DigitalMedia & Interaction Research Center, Hangzhou Normal University, Wenzhou People's Hospital, Wenzhou, 325000 People's Republic of ChinaSearch for more papers by this authorXinting Wang, Xinting Wang DigitalMedia & Interaction Research Center, Hangzhou Normal University, Wenzhou People's Hospital, Wenzhou, 325000 People's Republic of ChinaSearch for more papers by this author First published: 29 March 2018 https://doi.org/10.1049/trit.2018.0005Citations: 9 AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinkedInRedditWechat Abstract This study explores the segmentation algorithm of item text data, especially of single long length data in health examination. In the specific implementation, a large amount of historical health examination data is analysed. Using the method of character statistics, the connection tightness values TAB s between two adjacent characters are calculated. Three parameters, the candidate number N, the best position BP, and balance weight BW are set. The total segmentation indexes SIs are calculated, thus determined the segmentation position Pos. The optimal parameter values are determined by the method of information measurement. Experimental results show that the accuracy rate is 78.6% and reaches 82.9% in the most frequently appeared text item. The complexity of the algorithm is O (n). Using no existing domain knowledge, it is very simple and fast. By executed repeatedly, it is convenient to obtain the characteristics of each single item of text data, furthermore, to distinguish respective express preference of different physicians to the same item. The assumption is verified that without professional domain knowledge, a large amount of historical data can provide valuable clues for the text understanding. The results of this research are being applied and verified in the following research works in the field of health examination. 1 Introduction Health information collection is the first step in the trilogy of health management and disease preventive treatment in traditional Chinese medicine (TCM), which is the basis of subsequent health risk assessment and health intervention [[1]]. Health examination data is the most important source of health information, which plays a pivotal role in the health management industry chain in China. At present, a large amount of health examination data has been obtained [[2]], among which, precious data of unstructured text type is difficult to be used for automatic health assessment. Up to now, text data analysis and evaluation are mainly performed by artificially string matching; while, they are lack of automation and intelligence due to the difficulty in comprehension and meticulousness, and also necessary to be checked manually leading to the low efficiency. In China, Health examination becomes popular after SARS in 2003. With the development of social economy, the improvement of people's living standards, and people's increasing attention to their own health, the health examination industry has developed rapidly. This job is not only medical work, but also closely related to the commercial operation. A large number of records have been accumulated over the past 10 years. These records are not as strict or formal as the clinical medical files, especially the text type data. Mixed using and abusing of traditional Chinese medicine and Western Medicine terminology, colloquial expression, vague concept, and so on, the insufficiency of these misleads to a poor quality health examination records. It is difficult to analyse and utilise these text data, and there are few research specially carried out on them. However, these physical examination data records the changes of the health for people, especially those who have regular annual physical examination and have an important potential value. Our team is carrying out several researches related to health examination: construction of knowledge graph in the field of health examination, development of special input method for health examination results, design of intelligent and automated method for health examination results evaluation, visualisation of health examination results, and so on. All these researches need to deal with the analysis of health examination result of text type. In the previous attempts, we found that the tools and methods of clinical text analysis are not so applicable. Although the health examination data is not standardised, and there are a large number of individual categories and items, each specific item has its unique characteristics, in which the expression of information is limited to a very limited range. What we need is the characteristics of each single item of text data in health examination, furthermore, respective express preference of different physicians to the same item. No relevant research results of the characteristics analysis of the text item data in health examination have been found. As a result, an algorithm of the starting point of the above study is needed and it should be as simple as possible. Any existing domain knowledge is not used for the time being to avoid too much restriction on the algorithm and results due to the purpose of the algorithm is text features and knowledge discovery. Similarities and differences from the large sample of item data have been selected for clues. The algorithm must be simple enough and can be executed repeatedly. It will run repeatedly for a large number of existing and continuously emerging item data. Perhaps, personal data by different physicians need to be analysed in real time. The algorithm does not pursue the perfect result at one time, which will be continuously verified and improved in the use and interaction with the doctor. The algorithm will be upgraded to make use of the verified knowledge to improve the ability of text analysis. This study is such a simple starting algorithm. It is conducted by analysing a large amount of historical health examination data with character statistics and information measurement used. The goal is to search for the inherent law of the specific field jargon, and to explore appropriate algorithm and tool for encoding and analysis of text data in health examination. It will provide a basis for follow-up researches. 1.1 Related work There is a large amount of health information in form of natural language, which is difficult to be analysed and utilised. The analysis of medical texts for the purpose of information extraction and knowledge discovery has been the focus of the research. Spasić reported KneeTex (a system for information extraction of knee pathology from MRI reports) which is modelled by a set of sophisticated lexico-semantic rules with minimal syntactic analysis in combination with the ontology [[3]]. Nguyen assessed the utility of Medtex on automating cancer registry notifications from pathology HL7 messages [[4]]. Koopman automatically extracted ICD-10 classification information of cancers from free-text death certificates [[5]]. Yepes used the technology of machine learning to improve the performance of Mesh keyword indexing program such as MTI [[6]]. Chard leveraged cloud-based approaches to solve the problem of poor accessibility, scalability, and flexibility of natural language processing (NLP) systems on processing medical text [[7]]. Botsis demonstrated a multilevel text mining approach for automatic rule-based text classification of adverse event reports that could potentially reduce human workload [[8]]. Li reported the research on information extraction based on domain ontology, which can improve the computer's ability of information extracting and knowledge discovering from electronic medical records in Chinese [[9]]. Nishmoto constructed a medical dictionary for ChaSen from unified medical language system (UMLS) believing that retrieval of transitional probability would improve the accuracy of parsing compound medical terms [[10]]. Zhou proposed a method and a prototype system for discovering implicit temporal assertions in medical text by applying discourse analysis as well as semantic and syntactic analysis, and by generating heuristic rules that encode the discovered domain and linguistic knowledge [[11]]. Yetisgenyildiz improved the efficiency of MEDLINE document classification by medical phrases extracting based on the medical knowledge base and NLP [[12]]. Niu treated analysis of the polarity information of clinical outcomes as a classification problem, which could be solved by NLP and supervised machine learning [[13]]. Travers evaluated an emergency medical text processor, a system for cleaning chief complaint text data [[14]]. There are many similar researches in China, in which Chinese word segmentation methods are used [[15]–[17]], and the research field is extended to traditional Chinese medicine [[18]–[22]]. As mentioned above, the current researches and applications on medical text processing are based on NLP, like lexical, syntactic, and semantic analysis. Ontology, knowledge base, and other medical expertise in specific areas are often used. The goals are to extract a small amount of specific information. It is difficult to use medical NLP and it is difficult to obtain and maintain comprehensive domain knowledge; furthermore, it is difficult for the specific researches to be extended to related fields. Reports on analysis of text data in health examination are rare. These related works utilise specific domain knowledge to extract a small number of information of specific purpose from a large number of raw data. The obtained information has limited amount, and may cause important information omissions, which is not suitable for the analysis of health examination data and the discovery of unknown knowledge and rules. 1.2 Data source The data used in this paper came from a health examination department in a top-level first grade hospital in Wenzhou, Zhejiang, China. The work of health examination has been carried out for 20 years there. Health examination software was introduced at the end of 2009 and electronic data has been saved for more than 7 years from then on with about 20,000 people per year. The software is developed by a Hangzhou medical software company, who has a relatively high market share. The data shows the common data condition in Chinese health examination. 1.3 Data status Health examination results of 130,028 people have been stored in the database. There are 11,380,790 rows in the detailed data table, and 599 items are involved. The items can be divided into three types according to the health examination methods – laboratory test type, physical examination type, and instrument check type. The results are saved as numeric or textual data, as shown in Table 1. Table 1. Data types, length, and freedom for different types of health examination items Exam type Number Text Subtotal Text length Choice freedom laboratory 169 84 253 2.3 ± 1.41 6.0 ± 7.00 physical 81 168 249 4.0 ± 4.21 141.6 ± 566.37 instrument 2 95 97 35.6 ± 52.78 813.8 ± 2499.84 total 252 347 599 12.3 ± 31.18 292.8 ± 1397.42 Laboratory results are mainly of numerical type, and data of text type is very short, with an average of 2.3 and all in 5 characters. Also, they have strictly limited range for input choices, with only an average of six kinds. Two-thirds of the physical examination results are text type. They are also mainly short, while the numbers of input freedom vary greatly with no more than ten kinds and sometimes are very high. Instrument check results are mainly text type, their length and input freedom increase significantly, as shown in Table 1 and Figs. 1 and 2. Fig. 1Open in figure viewerPowerPoint Length distribution of text data for different result types Fig. 2Open in figure viewerPowerPoint Freedom degrees for different result types 1.4 Problems The difficulty degrees of health item data to be analysed and utilised vary greatly according to the data types. Numerical results can be used most easily, because they always have reference ranges, according to which a given result is confirmed as normal or not, even to get its degree abnormality. Most laboratory test results and some physical examination indicators are in this category. Text results of shorter length and limited degrees of input freedom are not so difficult because the possible results can be listed easily and assessed separately. All the laboratory test results and lots of physical and instrument ones are this type. It turns to be the most difficult one for the text data of long length and high input freedom degree since there are no strict format specifications and can be input arbitrarily. Current measures include the following series for the analysis and utilisation of long text data: all the data are ignored directly, just not used; in addition to these original data, the physicians are also required to input a thumbnail copy which can be assessed relatively easily, leading to duplication of work and increase of medical staff burden; manual reading and analysis; natural languages are too flexible and complex by keywords matching and it is difficult to list all the keywords comprehensively without strictly input constraints, which causes the necessity of manual review. The problem of regular expressions is the same as keywords. These methods are lack of automation and intelligence resulting in low efficiency. In order to make better utilisation of these texts, it is necessary to analyse the structures and rules of the data. The large amount of historical data accumulated in the physical examination system can play an important role. In this study, we explore the methods of long text data analysing and provide methods and tools for encoding based on the historical health examination data, compression, structuring, analysis, and assessment, thus achieving more automatic and intelligent health assessment. 2 Data processing algorithm Natural languages have very high freedom degrees of expression, especially in Chinese. However, when applied to a specific context, the degree of freedom is limited. A health examination item describes a single physiological or test outcome, its degree of freedom was obviously stricter. In the 347 types of health examination items with input freedoms of 4, between 5 and 64, and more than 256, account for 32.3, 79.3, and 9.5%, respectively. Higher degree of freedom resulted in longer text length; while there must exist context domain constraints and unique language fingerprints like character frequencies, word frequencies, and their connection rules. To use analysis and evaluation in a better way, the long unstructured texts should first be segmented, encoded, and structured. The information in long unstructured text includes each short sentence and their permutation sequence. First of all, the short sentences need to be analysed and segmented and each sentence can be regarded as a piece of basic information, including the item name and the corresponding value. Take the sentence 'Intrahepatic light spots are thickening and disorder' as an example, 'Intrahepatic light spots' should be regarded as its item name and '(are thickening and disorder)' as its value; and after the segmentation, the sentence can be easier to be encoded and classified, being ready for analysis and evaluation. Based on the assumptions above, this study employed the large amount of historical health examination data and constructs a text analysis algorithm with the character statistics and information measurement used. The algorithm is developed by C# language, and exemplified by the B ultrasound results of liver as an example, as described below. 2.1 Data preparation To avoid the impact on medical online services, the 11,380,790 rows of data are exported into a Microsoft LocalDB database with the table name 'ExaminItemResults'. The main column information is shown in Table 2. Liver B ultrasound data is one of the most common type of long text, with the examination item number '050001', and a total of 82,772 rows saved. Table 2. Column information for the table ExaminItemResults Column name Data type Length Description CustomID varchar 16 customer number ItemID varchar 12 examination item number ItemName varchar 60 examination item name ItemResult nvarchar max examination item result ExaminDoctor char 7 doctor number 2.2 Data loading and numerical substitution In order to merge the same results, the structured query language (SQL) aggregation statement is used as code 1. About 12,941 results are returned from the database, in which the default normal results occur most frequently, and the count is 41,383. (Code1) There are many measured value in the texts, such as the size of liver or liver cyst, and the figures will affect the classification. So, a regular expression is used to identify and replace all the figures with a placeholder '┻', then the number of result kinds reduces to 7438. As shown the regular expression below: (Code1) Using code 3 below, all the text results are divided into 4518 kinds of sentences (Code1) 2.3 Character frequency counting and segmentation The connection tightness values TAB s between two adjacent characters A and B are calculated as follows: first of all, three frequencies are counted, FA* represents the frequency of arbitrary two adjacent characters that start with A, F*B end with B, FAB start with A and end with B. Three candidate formulas (1a)–(1c) are shown below. By comparison, (i)–(iii) shows the best performance (1a) (1b) (1c) By adding an end tag to each sentence, the same number as the containing characters of TAB s can be counted. Then the TAB s are sorted in ascending order, and the first N TAB s are chosen and used to segment the sentence. All the front parts are counted and sorted in descending order. Then each TAB gets its own front part order FO. Setting a new parameter BP, which means best position, the balance indexes BIs can also be calculated for each position of TAB, shown as follows: (2) where Pos represents the split position of the sentence and Len is the length of the sentence. Setting another parameter BW, balance weight, the sum split indexes SIs of each candidate position can be calculated (3) In each sentence, the segment position with the largest SI is chosen finally. 2.4 Determination of the optimal value of parameters N, BP, and BW Each sentence can be classified according to its front segmentation after segmented, which represents the problem KEY the sentence describes and the latter part represents the CHOICE the sentence makes about the KEY. A dictionary is then built, where stores all the N KEYs and all the M CHOICEs for each KEY, and the storage space of the dictionary SD can be calculated as follows: (4) Two parts are required for encoding and storing each sentence, the first is for the KEY code, and the second for CHOICE code. The storage space for all detailed sentences SS and the total storage space ST are calculated as follows: (5) (6) Different SD, SS, and ST can be calculated according to different parameter values of N, BP, and BW. By sorting SD in ascending order, the order rank of SD, SS, OSD, and OST can be obtained. OAVG is the average of OSD and OST. The optimal parameter values, 2, 9, 0.8 are determined by the minimum OAVG, as shown in Table 3 Table 3. Storage space and order rank for different parameters N BW BP ST SD OST OSD OAVG 2 9 0.8 7,804,237 657,296 97 39 68 2 9 0.7 7,837,962 655,968 106 31 68.5 2 8 0.7 7,836,214 657,200 105 36 70.5 7 9 1 7,807,007 658,320 98 47 72.5 5 9 0.9 7,898,229 653,760 130 17.5 73.75 5 9 1 7,917,209 651,840 143 5 74 2 7 0.7 7,803,308 659,840 96 55.5 75.75 2 9 0.6 7,852,477 657,760 112 41 76.5 2 8 0.6 7,851,103 657,984 111 43 77 2 8 0.8 7,800,967 660,416 95 60 77.5 8 9 1 7,793,990 660,656 93 65 79 3 Experimental results and analysis 3.1 Segmentation results Experimental results above show that the segmentation results of sentences have been used for 10 or more times and the accuracy rate is 78.6%. As shown in Table 4, the weighted accuracy rate is 80.3%, which reaches 82.9% in the results of the most frequently appeared (more than 100 times) long text. Table 4. Segmentation results for the most frequently appeared text Short sentences Key (value) 包膜a 光滑 capsule (smooth) 呈散射状a 回声 scattering (echo) 分布a 欠均匀 distribution (uneven) 肝静脉a 变细 hepatic vein (thinning) 肝静脉a 稍变细 hepatic veins (slight thinning) 肝静脉a 显示清晰 hepatic veins (clear) 肝静脉a 显示尚清 hepatic veins (relatively clear) 肝内胆管a 未见扩张 intrahepatic bile duct (non-expansion) 肝内管道系统a 显示欠清晰 intrahepatic duct system (display less clear) 肝内管道a 显示欠清晰 intrahepatic duct (display less clear) 肝内外胆管a 不扩张 intra and extra hepatic bile duct (non-dilatation) 肝内血管b 网络a 显示清晰 intrahepatic vascular network (clear) 肝内血管b 网络a 显示尚清 intrahepatic vascular network (relatively clear) 肝实质a 回声均匀 liver parenchyma (uniform echo) 肝实质回声a 稍细密增强 liver parenchyma echo (slightly enhanced) 肝实质回声a 细密增强 liver parenchyma echo (enhanced) 肝脏大小b 形态a 正常 liver size and shape (normal) 肝脏大小a 正常 liver size (normal) 肝脏切面形态、大小a 未见异常 liver size and shape (non-abnormalities) 肝脏a 增大 liver (enlargement) 管腔显示a 清晰 lumen (clear) 后方未见明显a 声衰减 no obvious posterior (sound attenuation) 后声a 衰减不明显 back acoustic attenuation (not obvious) 门静脉a 未见异常 portal vein (non-abnormalities) 门脉系统a 未见扩张 portal vein system (no expansion) 内部回声a 弥漫性增强增b 密呈散射状 internal echo (diffusely enhanced, dense, scattering) 实质回声a 较粗强 substantial echo (relatively coarse and strong) 实质回声a 均匀 substantial echo (uniformity) 实质回声a 细密增强 substantial echo (fine reinforcement) 随深度增加b 后方a 回声逐渐衰减 rear with increase depth (gradually attenuated echo) 随着深度的b 增加a 逐渐衰减 with increase depth (gradually attenuated) 未见明显a 占位 no obvious (occupying mass) 右肝内见一枚ac mm的强回声斑伴声影 right liver (c mm strong echo spot with acoustic shadow) 右肝斜径a 约c mm right hepatic oblique diameter (aboutc mm) 右叶斜径ac mm right hepatic oblique diameter (c mm) a Accurate segmentation position. b Not accurate segmentation position. c Digital placeholder 3.2 Algorithm efficiency The algorithm is of high execution efficiency; the complexity is O (n) according to the data row count n. In the VS.Net 2015 development environment, a demo has been developed with the usage of C# language and WPF interface. Regardless of the time of loading data from the database, it takes 170 ms for the first time to run in x64 Win10, i5-4590 CPU, 4G memory debugging environment, and only 90 ms for later time. The determination of the optimal values for N, BP, and BW requires up to 810 times execution cycles of the algorithm, consuming about 46,912 ms. 3.3 Limitations and further improvement This algorithm accomplishes the segmentation of historical text data in health examination, and is only based on character statistics and information measurement without manual intervention. It runs fast and efficiently, and achieves the expected ideal results. However, there are limitations of the algorithm because the accuracy still needs further improvement. The possible reasons include: (i) the results are input arbitrarily causing irregularities and errors; (ii) some sentences of results have inadequate frequencies to display the language clue needed by the algorithm; (iii) some sentences do not match the assumed KEY-CHOICE pattern; (iv) the syntax and semantics are too complicated in Chinese; (v) the algorithm only measured and compared the connection tightness is two characters. In order to improve the segmentation accuracy, future work may be performed as follows: (i) introducing professional Chinese word segmentation and other NLP tools; (ii) maintaining custom dictionary to justify abnormal TAB s; (iii) standardising physician input operation, and screening data of high quality; (iv) considering connection tightness within more than two characters. 3.4 Practical application This algorithm has achieved the expected goals, laying a good foundation for the follow-up period of work and research. Based on this algorithm, several research in our team is progressing smoothly. By this algorithm, we obtain the structure characteristic of all the individual text item data, and construct mini knowledge graphs for each item. Physicians can use these mini graphs for the input of text item data. The application of text segmentation results greatly reduces the degree of input freedom, so the input method is to slide the finger on the touch screen. As the algorithm can analyse and treat physician's personal preferences, it can greatly improve the convenience and speed of Chinese character input. In the process of using this input method, the accurately segmented results are often touched; while the poor ones are seldom or never touched. The accuracy of segmentation can be judged through the use and interaction with physicians. In the later period, we will develop a new algorithm to judge the segmentation results using the physician's interaction information, and help this algorithm to improve the ability of text segmentation. Also by this algorithm, the original unstructured textual data can be structured, and greatly reduces the difficulty of analysis of health examination text data. This algorithm reduces the freedom degree of text data in health examination, thus reduces the difficulty of analysis. Therefore, this algorithm also contributes a lot to the work of setting up an intelligent and automated method for health examination results evaluation. The above studies will be reported later. 4 Conclusion This study employs historical health examination data, and makes the long text segmentation in health examination based on character statistics and information measurement. The assumption is verified that without professional domain knowledge, a large amount of historical data can provide valuable clues for the text understanding. The toolkit can be used in automatic data analysis, encoding, lossless compression, encryption, structured storage, and information classification, which thus can make health assessment more automatic and intelligent. The results of this research are being applied and verified in the works of our team, such as the construction of knowledge graph in the field of health examination, development of special input method for health examination results, design of intelligent and automated method for health examination results evaluation, visualisation of health examination results, and so on. Possible applications of the algorithm include: (i) implement the automatic encoding and compression for text data. In the experiment above, each liver B ultrasound result only needs to be stored in an average of 5.89 bytes, which is significantly reduced compared to original 56 bytes. The compression is lossless and loyal to the physician's input; thus the data can be completely recovered. Compression can greatly reduce the pressure load of the network and database system; (ii) with this encoding used, a certain degree of encryption can be achieved to improve the safety of medical information; (iii) with this encoding used, the texts are better structured and greatly reduced in freedom, thus can lead to better information classification, evaluation, and analysis. 5 Acknowledgments This project is supported by the Health and Family Planning Commission of Zhejiang Province, Wenzhou Science & Technology Bureau, and Wenzhou People's Hospital. 6 References [1]Guo Q. Wang P.Y. Wen D.L. et al.: ' Health management ' ( People's Medical Publishing House, Beijing, 2015) [2]Guo Q. Ren J.P. Meng F.L. et al.: ' Report on health service development in China 2015 ' ( People's Medical Publishing House, Beijing, 2016) [3]Spasić I Zhao B. Jones C.B. et al.: 'Kneetex: an ontology-driven system for information extraction from MRI reports', J. Biomed. Semant., 2015, 6, (1), pp. 1– 26 (doi: 10.1186/s13326-015-0033-1) [4]Nguyen A.N. Moore J. O'Dwyer J. et al.: 'Automatic cancer registry notifications: validation of a medical text analytics system for identifying patients with cancer from a state-wide pathology repository'. AMIA Symp., Chicago, USA, 2016, pp. 964– 973 [5]Koopman B. Zuccon G. Nguyen A. et al.: 'Automatic ICD-10 classification of cancers from free-text death certificates', Int. J. Med. Inform., 2015, 84, (11), pp. 956– 965 (doi: 10.1016/j.ijmedinf.2015.08.004) [6]Yepes A.J.J. Mork J.G. Demnerfushman D. et al.: 'Comparison and combination of several MeSH indexing approaches'. 2013 AMIA Annual Symp. Proc., 2013, pp. 709– 718 [7]Chard K. Russell M. Lussier Y.A. et al.: 'A cloud-based approach to medical NLP'. 2011 AMIA Symp., AMIA Annual Symp. Proc., Chicago, USA, 2011, (3), p. 207 [8]Botsis T. Nguyen M.D. Woo E.J. et al.: 'Text mining for the vaccine adverse event reporting system: medical text classification using informative feature selection', J. Am. Med. Inf. Assoc., 2011, 18, (5), p. 631 (doi: 10.1136/amiajnl-2010-000022) [9]Li Y. Bao P. Xue W.: 'Research on information extraction of electronic medical records in Chinese', J. Biomed. Eng., 2010, 27, (27), pp. 757– 762 [10]Nishimoto N. Terae S. Uesugi M. et al.: 'Development of a medical-text parsing algorithm based on character adjacent probability distribution for Japanese radiology reports', Methods Inf. Med., 2008, 47, (6), pp. 513– 521 [11]Zhou L. Parsons S. Hripcsak G.: 'Handling implicit and uncertain temporal information in medical text'. AMIA Annual Symp. Proc./AMIA Symp., AMIA Symp., Chicago, USA, 2006, p. 1158 [12]Yetisgenyildiz M. Pratt W.: 'The effect of feature representation on MEDLINE document classification', AMIA Annual Symp. Proc., 2005, vol. 2005, pp. 849– 853 [13]Niu Y. Zhu X. Li J. et al.: ' Analysis of polarity information in medical text', Proc. Amia Annual Symp., Washington, USA, 2005, pp. 570– 574 [14]Travers D.A. Haas S.W.: 'Evaluation of emergency medical text processor, a system for cleaning chief complaint text data', Acad. Emerg. Med., Off. J. Soc. Acad. Emerg. Med., 2004, 11, (11), pp. 1170– 1176 (doi: 10.1111/j.1553-2712.2004.tb00701.x) [15]Li G.L. Chen X.L. Xia D. et al.: 'Research on Chinese medical records segmentation method', Chin. J. Biomed. Eng., 2016, 35, (4), pp. 477– 481 [16]Liu Q.Q.: ' Design and implementation of structured processing system for pathological examination text ' ( Donghua University, Shanghai, China, 2016) [17]Li X.Z.: ' Research and application of case-based reasoning system for medical diagnosis and treatment based on text mining ' ( Guangdong University of Technology, Guangzhou, Guangdong, China, 2011) [18]Cheng S.P Jia D.M Zhou P.S. et al.: 'Medication law of bone destruction in rheumatoid arthritis treated with Chinese medicine based on text mining', J. Tradit. Chin. Med., 2016, 57, (11), pp. 970– 974 [19]Ye H. Ji D.H.: 'Research on symptom and medicine information abstraction of TCM book Jin Gui Yao Lue based on conditional random field', Chin. J. Libr. Inf. Sci. Tradit. Chin. Med., 2016, 40, (5), pp. 14– 17 [20]Feng L.Z.: ' Research on construction method of large scale TCM clinical medical record corpus based on named entity extraction ' ( Beijing Jiaotong University, Beijing, China, 2015) [21]Wen T.C. Li P.: 'The structuring index system for famous TCM doctors medical record based on XML', China Digit. Med., 2013, 8, (7), pp. 22– 24 [22]Shen S.S. Zheng G Zhan J.P. et al.: 'To investigate the rules of symptom, syndrome, strategy and formula in the treatment of rheumatoid arthritis on date mining', Rheum. Arthritis, 2013, 10, pp. 5– 9 Citing Literature Volume3, Issue1March 2018Pages 28-32 FiguresReferencesRelatedInformation

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

Text segmentation of health examination item based on character statistics and information measurement