User login

Multilingual document analysis & recognition

  1. Vast diversity of document types:
    • languages & writing systems
    • typefaces and handwriting styles
    • sizes of text, e.g. more than one per page
    • page layouts
    • overlapping regions, e.g. HW notes on MP
    • orientation:  e.g. rightside up, upside down
    • skew, shear, & other geometric deformations
    • image degradations:  blur, thresholding, additive noise, etc
  2. Variety of image types:
    • color, grey, black-and-white
    • compressed vs uncompressed; and if compressed, lossy or lossless
    • carefully scanned with controlled illumination vs carelessly snapped
    • scanned flat vs in perspective, curved, etc
  3. Time and space optimization:

    Scaling up document-image classifiers to handle an unlimited variety of document and image types poses serious challenges to conventional trainable classifier technologies. Highly versatile classifiers demand representative training sets which can be dauntingly large. In investigating document content extraction systems, we have demonstrated the advantages of employing as many as a billion training samples in approximate k-nearest neighbor (kNN) classifiers sped up using hashed K-d trees. Based on hashed K-d tree, we have already developed

    • Online bin-decimation, for coping with training sets that are too big to fit in main memory, which enforces an upper bound approximately on the number of training samples stored in each K-d hash bin; an adaptive statistical technique allows this to be accomplished online and in linear time, while reading the training data exactly once.
    • Active learning algorithm, which selects a small subset of training data, matched to each particular test set, in hopes of im-
      proved speed without loss of accuracy.