<?xml version="1.0" encoding="UTF-8"?>
<collection xmlns="http://www.loc.gov/MARC21/slim">
 <record>
  <leader>     caa a22        4500</leader>
  <controlfield tag="001">445379685</controlfield>
  <controlfield tag="003">CHVBK</controlfield>
  <controlfield tag="005">20180317143013.0</controlfield>
  <controlfield tag="007">cr unu---uuuuu</controlfield>
  <controlfield tag="008">170323e20110601xx      s     000 0 eng  </controlfield>
  <datafield tag="024" ind1="7" ind2="0">
   <subfield code="a">10.1007/s10032-010-0133-5</subfield>
   <subfield code="2">doi</subfield>
  </datafield>
  <datafield tag="035" ind1=" " ind2=" ">
   <subfield code="a">(NATIONALLICENCE)springer-10.1007/s10032-010-0133-5</subfield>
  </datafield>
  <datafield tag="100" ind1="1" ind2=" ">
   <subfield code="a">Reynaert</subfield>
   <subfield code="D">Martin</subfield>
   <subfield code="u">Tilburg Centre for Cognition and Communication, Tilburg University, Kamer D 342, P.O. Box 90153, 5000 LE, Tilburg, Netherlands</subfield>
   <subfield code="4">aut</subfield>
  </datafield>
  <datafield tag="245" ind1="1" ind2="0">
   <subfield code="a">Character confusion versus focus word-based correction of spelling and OCR variants in corpora</subfield>
   <subfield code="h">[Elektronische Daten]</subfield>
   <subfield code="c">[Martin Reynaert]</subfield>
  </datafield>
  <datafield tag="520" ind1="3" ind2=" ">
   <subfield code="a">We present a new approach based on anagram hashing to handle globally the lexical variation in large and noisy text collections. Lexical variation addressed by spelling correction systems is primarily typographical variation. This is typically handled in a local fashion: given one particular text string some system of retrieving near-neighbors is applied, where near-neighbors are other text strings that differ from the particular string by a given number of characters. The difference in characters between the original string and one of its retrieved near-neighbors constitutes a particular character confusion. We present a global way of performing this action: for all possible particular character confusions given a particular edit distance, we sequentially identify all the pairs of text strings in the text collection that display a particular confusion. We work on large digitized corpora, which contain lexical variation due to both the OCR process and typographical or typesetting error and show that all these types of variation can be handled equally well in the framework we present. The character confusion-based prototype of Text-Induced Corpus Clean-up (ticcl) is compared to its focus word-based counterpart and evaluated on 6years' worth of digitized Dutch Parliamentary documents. The character confusion approach is shown to gain an order of magnitude in speed on its word-based counterpart on large corpora. Insights gained about the useful contribution of global corpus variation statistics are shown to also benefit the more traditional word-based approach to spelling correction. Final tests on a held-out set comprising the 1918 edition of the Dutch daily newspaper ‘Het Volk' show that the system is not sensitive to domain variation.</subfield>
  </datafield>
  <datafield tag="540" ind1=" " ind2=" ">
   <subfield code="a">The Author(s), 2010</subfield>
  </datafield>
  <datafield tag="773" ind1="0" ind2=" ">
   <subfield code="t">International Journal on Document Analysis and Recognition (IJDAR)</subfield>
   <subfield code="d">Springer-Verlag</subfield>
   <subfield code="g">14/2(2011-06-01), 173-187</subfield>
   <subfield code="x">1433-2833</subfield>
   <subfield code="q">14:2&lt;173</subfield>
   <subfield code="1">2011</subfield>
   <subfield code="2">14</subfield>
   <subfield code="o">10032</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2="0">
   <subfield code="u">https://doi.org/10.1007/s10032-010-0133-5</subfield>
   <subfield code="q">text/html</subfield>
   <subfield code="z">Onlinezugriff via DOI</subfield>
  </datafield>
  <datafield tag="908" ind1=" " ind2=" ">
   <subfield code="D">1</subfield>
   <subfield code="a">research-article</subfield>
   <subfield code="2">jats</subfield>
  </datafield>
  <datafield tag="950" ind1=" " ind2=" ">
   <subfield code="B">NATIONALLICENCE</subfield>
   <subfield code="P">856</subfield>
   <subfield code="E">40</subfield>
   <subfield code="u">https://doi.org/10.1007/s10032-010-0133-5</subfield>
   <subfield code="q">text/html</subfield>
   <subfield code="z">Onlinezugriff via DOI</subfield>
  </datafield>
  <datafield tag="950" ind1=" " ind2=" ">
   <subfield code="B">NATIONALLICENCE</subfield>
   <subfield code="P">100</subfield>
   <subfield code="E">1-</subfield>
   <subfield code="a">Reynaert</subfield>
   <subfield code="D">Martin</subfield>
   <subfield code="u">Tilburg Centre for Cognition and Communication, Tilburg University, Kamer D 342, P.O. Box 90153, 5000 LE, Tilburg, Netherlands</subfield>
   <subfield code="4">aut</subfield>
  </datafield>
  <datafield tag="950" ind1=" " ind2=" ">
   <subfield code="B">NATIONALLICENCE</subfield>
   <subfield code="P">773</subfield>
   <subfield code="E">0-</subfield>
   <subfield code="t">International Journal on Document Analysis and Recognition (IJDAR)</subfield>
   <subfield code="d">Springer-Verlag</subfield>
   <subfield code="g">14/2(2011-06-01), 173-187</subfield>
   <subfield code="x">1433-2833</subfield>
   <subfield code="q">14:2&lt;173</subfield>
   <subfield code="1">2011</subfield>
   <subfield code="2">14</subfield>
   <subfield code="o">10032</subfield>
  </datafield>
  <datafield tag="900" ind1=" " ind2="7">
   <subfield code="a">Metadata rights reserved</subfield>
   <subfield code="b">Springer special CC-BY-NC licence</subfield>
   <subfield code="2">nationallicence</subfield>
  </datafield>
  <datafield tag="898" ind1=" " ind2=" ">
   <subfield code="a">BK010053</subfield>
   <subfield code="b">XK010053</subfield>
   <subfield code="c">XK010000</subfield>
  </datafield>
  <datafield tag="949" ind1=" " ind2=" ">
   <subfield code="B">NATIONALLICENCE</subfield>
   <subfield code="F">NATIONALLICENCE</subfield>
   <subfield code="b">NL-springer</subfield>
  </datafield>
 </record>
</collection>
