<?xml version="1.0" encoding="UTF-8"?>
<collection xmlns="http://www.loc.gov/MARC21/slim">
 <record>
  <leader>     nad a22        4500</leader>
  <controlfield tag="001">555163199</controlfield>
  <controlfield tag="005">20190202120452.0</controlfield>
  <controlfield tag="007">cr unu---uuuuu</controlfield>
  <controlfield tag="008">190202s2018    xx      s     000 0 eng  </controlfield>
  <datafield tag="024" ind1="7" ind2="0">
   <subfield code="a">10.5167/uzh-162394</subfield>
   <subfield code="2">doi</subfield>
  </datafield>
  <datafield tag="035" ind1=" " ind2=" ">
   <subfield code="a">(ZORA)oai:www.zora.uzh.ch:162394</subfield>
  </datafield>
  <datafield tag="084" ind1=" " ind2=" ">
   <subfield code="a">000</subfield>
   <subfield code="2">ddc</subfield>
  </datafield>
  <datafield tag="084" ind1=" " ind2=" ">
   <subfield code="a">410</subfield>
   <subfield code="2">ddc</subfield>
  </datafield>
  <datafield tag="100" ind1="1" ind2=" ">
   <subfield code="a">Amrhein</subfield>
   <subfield code="D">Chantal</subfield>
  </datafield>
  <datafield tag="245" ind1="1" ind2="0">
   <subfield code="a">Supervised OCR Error Detection and Correction Using Statistical and Neural Machine Translation Methods</subfield>
   <subfield code="h">[Elektronische Daten]</subfield>
   <subfield code="c">[Chantal Amrhein, Simon Clematide]</subfield>
  </datafield>
  <datafield tag="506" ind1=" " ind2=" ">
   <subfield code="a">openAccess</subfield>
   <subfield code="2">eu-repo</subfield>
  </datafield>
  <datafield tag="520" ind1="3" ind2=" ">
   <subfield code="a">For indexing the content of digitized historical texts, optical character recognition (OCR) errors are a hampering problem. To explore the effectivity of new strategies for OCR post-correction, this article focuses on methods of character-based machine translation, specifically neural machine translation and statistical machine translation. Using the ICDAR 2017 data set on OCR post-correction for English and French, we experiment with different strategies for error detection and error correction. We analyze how OCR post-correction with NMT can profit from using additional information and show that SMT and NMT can benefit from each other for these tasks. An ensemble of our models reached best performance in ICDAR's 2017 error correction subtask and performed competitively in error detection. However, our experimental results also suggest that tuning supervised learning for OCR post-correction of texts from different sources, text types (periodicals and monographs), time periods and languages is a difficult task: the data on which the MT systems are trained have a large influence on which methods and features work best. Conclusive and generally applicable insights are hard to achieve.</subfield>
  </datafield>
  <datafield tag="690" ind1=" " ind2="7">
   <subfield code="a">Institute of Computational Linguistics</subfield>
   <subfield code="2">zora</subfield>
  </datafield>
  <datafield tag="690" ind1=" " ind2="7">
   <subfield code="a">OCR post-correction Machine Learning Neural Machine Translation Statistical Machine Translation</subfield>
   <subfield code="2">zora</subfield>
  </datafield>
  <datafield tag="700" ind1="1" ind2=" ">
   <subfield code="a">Clematide</subfield>
   <subfield code="D">Simon</subfield>
   <subfield code="e">joint author</subfield>
  </datafield>
  <datafield tag="773" ind1="0" ind2=" ">
   <subfield code="t">Journal for Language Technology and Computational Linguistics (JLCL)</subfield>
   <subfield code="g">33(1):49-76</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2="0">
   <subfield code="u">https://www.zora.uzh.ch/id/eprint/162394/1/AmrheinClematide2018.pdf</subfield>
   <subfield code="q">text/html</subfield>
   <subfield code="z">WWW-Backlink auf das Repository</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2="2">
   <subfield code="z">Onlinezugriff via WWW</subfield>
   <subfield code="q">text/html</subfield>
   <subfield code="u">https://jlcl.org/content/2-allissues/1-heft1-2018/jlcl_2018-1_3.pdf</subfield>
   <subfield code="B">ZORA</subfield>
  </datafield>
  <datafield tag="908" ind1=" " ind2=" ">
   <subfield code="D">1</subfield>
   <subfield code="a">Journal Article</subfield>
   <subfield code="z">PeerReviewed</subfield>
   <subfield code="2">zora</subfield>
  </datafield>
  <datafield tag="909" ind1=" " ind2="7">
   <subfield code="a">SNSF/Projectfunding/CRSII5_173719/CH</subfield>
   <subfield code="2">zora grantAgreement</subfield>
  </datafield>
  <datafield tag="950" ind1=" " ind2=" ">
   <subfield code="B">ZORA</subfield>
   <subfield code="P">856</subfield>
   <subfield code="E">40</subfield>
   <subfield code="u">https://www.zora.uzh.ch/id/eprint/162394/1/AmrheinClematide2018.pdf</subfield>
   <subfield code="q">text/html</subfield>
   <subfield code="z">WWW-Backlink auf das Repository</subfield>
  </datafield>
  <datafield tag="950" ind1=" " ind2=" ">
   <subfield code="B">ZORA</subfield>
   <subfield code="P">856</subfield>
   <subfield code="E">42</subfield>
   <subfield code="z">Onlinezugriff via WWW</subfield>
   <subfield code="q">text/html</subfield>
   <subfield code="u">https://jlcl.org/content/2-allissues/1-heft1-2018/jlcl_2018-1_3.pdf</subfield>
  </datafield>
  <datafield tag="950" ind1=" " ind2=" ">
   <subfield code="B">ZORA</subfield>
   <subfield code="P">100</subfield>
   <subfield code="E">1-</subfield>
   <subfield code="a">Amrhein</subfield>
   <subfield code="D">Chantal</subfield>
  </datafield>
  <datafield tag="950" ind1=" " ind2=" ">
   <subfield code="B">ZORA</subfield>
   <subfield code="P">700</subfield>
   <subfield code="E">1-</subfield>
   <subfield code="a">Clematide</subfield>
   <subfield code="D">Simon</subfield>
   <subfield code="e">joint author</subfield>
  </datafield>
  <datafield tag="950" ind1=" " ind2=" ">
   <subfield code="B">ZORA</subfield>
   <subfield code="P">773</subfield>
   <subfield code="E">0-</subfield>
   <subfield code="t">Journal for Language Technology and Computational Linguistics (JLCL)</subfield>
   <subfield code="g">33(1):49-76</subfield>
  </datafield>
  <datafield tag="898" ind1=" " ind2=" ">
   <subfield code="a">BK010053</subfield>
   <subfield code="b">XK010053</subfield>
   <subfield code="c">XK010000</subfield>
  </datafield>
  <datafield tag="949" ind1=" " ind2=" ">
   <subfield code="B">ZORA</subfield>
   <subfield code="F">ZORA</subfield>
   <subfield code="b">ZORA</subfield>
   <subfield code="j">Journal Article</subfield>
   <subfield code="c">openAccess</subfield>
  </datafield>
 </record>
</collection>
