User login

Document Analysis Algorithm Contributions in End-to-End Applications

Short Description

This contest aims to provide a metric giving indications on the influence of individual specific document analysis tools to overall end-to-end applications. Contestants are provided with a full, working pipeline operating on a raw document page image and proceeding to extract some final information. The pipeline is built with clearly identified analysis stages (e.g. binarization, skew detection, layout analysis, OCR ...) that have a formalized input and output. Contestants are invited to contribute their own algorithms as an alternative to one or more of the initially provided stages and evaluation will be done on the overall impact of the contributed algorithm to the final (end of pipeline) result.

  1. How to Participate

    Participants will need to provide access to one (or more) of the following algorithm implementations for this competition. Multiple entries for the same or for different categories are allowed:

    1. Document Binarization:

      programs contributed in this category should take any image as an input, and produce a binarized version of this image as output. The output file should be in an open and commonly used bitonal format (.tif, pbm, ...)

    2. Layout Analysis/Text Localization:

      programs contributed in this category should take any image as an input, or require that the input image format be bitonal, and produce as output either a single image, stripped from images, illustrations and, globally speaking, anything that is not text, either a set of images, corresponding to the text blocs of the original image.

    3. OCR:

      programs contributed in this category should take any image as an input, or require that the input image format be bitonal, and produce a transcription of the text contained in the image. The output should be a plain text file without any formating characters other than whitespace.

    4. Named Entity Detection:

      programs contributed in this category should take any a plain text file as an input and produce a tagged version of the same file as output, enriched with labels identifying found named entities.

    Elegible Software

    Software will only be considered for the contest it it is accessible as a web service. Although this may seem restrictive there are very simple ways to transform existing code into a web service. Either contestants download and use the wrapper code and set up their own hosting service as described here. In that case, they should be aware that, in order to compete in the contest, their service should be publicly accessible over the Internet. A second option is to agree having their contribution hosted on the DAE platform. Guidelines to do so will be provided soon and involve the same wrapper code as mentioned before. Interested contestants can contact the contest organizers.

    Executing and Testing Software

    The contest organizers provide a basic document analysis pipeline, composed by off-the-shelf document analysis web services, and a set of documents for testing. This pipeline successively applies the steps mentioned previously: binarization - text segmentation - OCR - named entity detection.

    The pipeline is available for download here, and needs the Taverna engine to be installed. Instructions on how to use Taverna are available here. IMPORTANT: please note that the provided pipelines require valid, public URLs as image input parameters, and not local filenames, for instance.

    It is understood that contestants can modify and extend this initial pipeline to their needs, during the process of testing and developing their contribution to the contest. However, only pipelines respecting the I/O constraints described before (between the binarization - segmentation - OCR and NLP steps) will be considered valid entries to the contest.

    Contributions Required for Entry

    Contestants hosting their software by their own means only need to provide a valid Taverna workflow description file (instructions and tools for verifying validity will be posted soon).

    Consestants requesting their software to be hosted must provide their software, bundled as a web service, as described here, and the same valid Taverna workflow description file.

    Tools and Data Sets

    The data used for this contest is the UNLV dataset, hosted on the DAE platform. In order to help tailor contestant's applications, the dataset contains the full original OCR text interpretations.

    The only tools provided to the contestants are:

  2. Full Contest Pipeline

    The final full pipeline as used for the contest, including all participating contributions is available as a Taverna workflow.

  3. Important Dates

    • Public Announcement - Call for Participation: March 15th
    • Availability of Datasets and tools: May 15th 2011
    • Deadline for Contributions: June 1st 2011
    • Technical Verifications and Test Runs: June 1st-3rd 2011
    • Execution of Competing Algorithms and compilation of results: June 3rd-5th
  4. Contact Information

For any further information, questions or concerns, please contact the contest organizers, Bart Lamiroy and Daniel Lopresti at DAREContest2011@cse.lehigh.edu.