thirdField

Bio-NER

At present, biomedical literature has an enormous quantity and continues to increase at high speed. People urgently need some automatic tools to process and analyze the biomedical literature. In the current methods, the model training time increases sharply when dealing with large-scale training samples. How to increase the efficiency of named entity recognition in biomedical big data becomes one of the key problems in biomedical text mining. For the purposes of improving the recognition performance and reducing the training time, through implementing the model training process based on MapReduce Based on the above study of time-space Hadoop MapReduce scheduling algorithms, this research poit proposes an optimization method for two-phase recognition using conditional random fields (CRFs) with some new feature sets. It is a case study in the field of biomedical big data mining. Compared to traditional methods and general MapReduce for data mining, our project makes originally inefficient algorithm become time-bearable in the case of integrating the above scheduling algorithms.

Problem Analysis

In the past several years, massive data have been accumulated and stored in different forms, whether in business enterprises, scientific research institutions, or government agencies. But when facing with more and more rapid expansion of the databases, people cannot set out to obtain and understand valuable knowledge within the big data.
The same situation has happened in the biomedical field. As one of the most concerned areas, especially after the human genome project (HGP), literature in biomedicine has appeared in large numbers, reaching an average of 600,000 or more per year. Meanwhile, the completion of the human genome project has produced large human gene sequence data. In addition, with the fast development of science and technology in recent years, more and more large-scale biomedical experiment techniques, which can reveal the law of life activities on the molecular level, must use the big data from the entire genome or the entire proteome, which results in huge amount of biological data. These mass biological data contain a wealth of biological information, including significant gene expression situation and protein-protein interaction. What is more, a disease network, which contains hidden information associated with the disease and gives biomedical scientists the basis of hypothesis generation, is constructed based on disease relationship mining in these biomedical data.
However, the most basic requirements for biomedical big data processing are difficult to meet efficiently. For example, keyword searching in biomedical big data or the Internet can only find lots of relevant file lists, and the accuracy is not high, so that a lot of valuable information contained in the text cannot be directly shown to the people.
Biomedical named entity recognition (Bio-NER) is the first and important and critical step in biomedical big data mining. It aims to help molecular biologists recognize and classify professional instances and terms, such as protein, DNA, RNA, cell_line, and cell_type. It is to locate and classify atomic elements with some special significance in biomedical text into predefined categories. The process of Bio-NER systems is structured as taking an unannotated block of text, and then producing an annotated block of text which highlights where the biomedical named entities are.
However, because of lots of unique properties in biomedical area, such as unstable quantity, non-unified naming rules, complex form, the existence of ambiguity and so on, Bio-NER is not mature enough, especially it takes much time. Most current Bio-NER systems are based on machine learning which need multiple iterative calculations from corpus data. Therefore, it is computationally intensive and seriously increases recognition time, including model training and inference. For example, it takes almost 5 hours for the CRFs model training process using Genia4ER training corpus which is only about 14MB. How do we confront tens of thousands of biomedical text data volume? How do we cope with unbearable wait of recognition for a long long time? It is natural to seek for distributed computing and parallel computing to solve the problem.