Scaling up data curation using deep learning: An application to literature triage in genomic variation resources

Verfasser / Beitragende:
[K. Lee, M.L. Famiglietti, A. McMahon, C.H. Wei, JAL MacArthur, S. Poux, L. Breuza, A. Bridge, F. Cunningham, I. Xenarios, Z. Lu]
Ort, Verlag, Jahr:
2018
Enthalten in:
PLoS computational biology, 14/8(2018-08), e1006390
Format:
Artikel (online)
ID: 528788035
LEADER caa a22 4500
001 528788035
003 CHVBK
005 20190303133655.0
007 cr unu---uuuuu
008 180924e201808 xx s 000 0 eng
024 7 0 |a 10.1371/journal.pcbi.1006390  |2 doi 
035 |a (SERVAL)BIB_916AEEE29E4F 
091 |a 30102703  |b pmid 
091 |a 000443298500034  |b isiid 
245 0 0 |a Scaling up data curation using deep learning: An application to literature triage in genomic variation resources  |h [Elektronische Daten]  |c [K. Lee, M.L. Famiglietti, A. McMahon, C.H. Wei, JAL MacArthur, S. Poux, L. Breuza, A. Bridge, F. Cunningham, I. Xenarios, Z. Lu] 
520 3 |a Manually curating biomedical knowledge from publications is necessary to build a knowledge based service that provides highly precise and organized information to users. The process of retrieving relevant publications for curation, which is also known as document triage, is usually carried out by querying and reading articles in PubMed. However, this query-based method often obtains unsatisfactory precision and recall on the retrieved results, and it is difficult to manually generate optimal queries. To address this, we propose a machine-learning assisted triage method. We collect previously curated publications from two databases UniProtKB/Swiss-Prot and the NHGRI-EBI GWAS Catalog, and used them as a gold-standard dataset for training deep learning models based on convolutional neural networks. We then use the trained models to classify and rank new publications for curation. For evaluation, we apply our method to the real-world manual curation process of UniProtKB/Swiss-Prot and the GWAS Catalog. We demonstrate that our machine-assisted triage method outperforms the current query-based triage methods, improves efficiency, and enriches curated content. Our method achieves a precision 1.81 and 2.99 times higher than that obtained by the current query-based triage methods of UniProtKB/Swiss-Prot and the GWAS Catalog, respectively, without compromising recall. In fact, our method retrieves many additional relevant publications that the query-based method of UniProtKB/Swiss-Prot could not find. As these results show, our machine learning-based method can make the triage process more efficient and is being implemented in production so that human curators can focus on more challenging tasks to improve the quality of knowledge bases. 
700 1 |a Lee  |D K.  |4 aut 
700 1 |a Famiglietti  |D M.L.  |4 aut 
700 1 |a McMahon  |D A.  |4 aut 
700 1 |a Wei  |D C.H.  |4 aut 
700 1 |a MacArthur  |D JAL  |4 aut 
700 1 |a Poux  |D S.  |4 aut 
700 1 |a Breuza  |D L.  |4 aut 
700 1 |a Bridge  |D A.  |4 aut 
700 1 |a Cunningham  |D F.  |4 aut 
700 1 |a Xenarios  |D I.  |4 aut 
700 1 |a Lu  |D Z.  |4 aut 
773 0 |t PLoS computational biology  |g 14/8(2018-08), e1006390  |q 14:8|1 2018  |2 14 
908 |D 1  |a article  |2 serval 
950 |B SERVAL  |P 700  |E 1-  |a Lee  |D K.  |4 aut 
950 |B SERVAL  |P 700  |E 1-  |a Famiglietti  |D M.L.  |4 aut 
950 |B SERVAL  |P 700  |E 1-  |a McMahon  |D A.  |4 aut 
950 |B SERVAL  |P 700  |E 1-  |a Wei  |D C.H.  |4 aut 
950 |B SERVAL  |P 700  |E 1-  |a MacArthur  |D JAL  |4 aut 
950 |B SERVAL  |P 700  |E 1-  |a Poux  |D S.  |4 aut 
950 |B SERVAL  |P 700  |E 1-  |a Breuza  |D L.  |4 aut 
950 |B SERVAL  |P 700  |E 1-  |a Bridge  |D A.  |4 aut 
950 |B SERVAL  |P 700  |E 1-  |a Cunningham  |D F.  |4 aut 
950 |B SERVAL  |P 700  |E 1-  |a Xenarios  |D I.  |4 aut 
950 |B SERVAL  |P 700  |E 1-  |a Lu  |D Z.  |4 aut 
950 |B SERVAL  |P 773  |E 0-  |t PLoS computational biology  |g 14/8(2018-08), e1006390  |q 14:8|1 2018  |2 14 
898 |a BK010053  |b XK010053  |c XK010000 
949 |B SERVAL  |F SERVAL  |b SERVAL  |j article