Optical Character Recognition -_-_- Document Understanding -_-_- Text Searching _-_-_ Digital Libraries _-_-_-
We provide consulting and analysis for Document Imaging, Recognition and Digital Library creation projects. Based on over 15 years dedicated experience in the field of scanning and OCR, and hands-on knowledge of hundreds of implementations, we provide uncommonly deep and broad insight on document digitization, high volume capture and conversion, and LAN/WAN and Intranet search and retrieval systems.
1996 Clients: Adobe, Bell Atlantic, Canon, Excalibur, Reed Technology & Information Systems, Silicon Biology, University of Pennsylvania, Xerox, Fujitsu, Zylab, ... and you?
Remember to use the BACK Button on your Browser! You can easily open up these articles, read 'em or print 'em, and quickly pop right back to this page by clicking on the BACK Button on your Browser!
Adobe Acrobat 3 offer advances in every aspect, from the newly 'optimized' PDF file format itself, to beefed up versions of Distiller, PDFWriter and the all new Capture Plug-in. Verity's new SearchPDF makes Catalog collections searchable over the web, with highlight words in the text - and it's free. UNBELIEVABLE!
Full Text Retrieval Engines promise access to the vast and growing universe of information on the Web. Spamdex gunks up the system, and the big engines have already defeated spamdexing. Here's a Seeker's-eye-view of a few of the big engines.
Page & Character Recognition comparison of leading OCR products and Acrobat Capture. Native RTF output of OCR printed to PDF by PDFWriter. All results are untouched. OCR results are designed to be edited, Capture results are designed to be published. The differences are dramatic!
OCR Comparison: Word Accuracy on 10 Documents, from memos to magazines. Commentary on Format Recognition and HTML Output.
Top level brief on Adobe Acrobat Capture and the place it may play in the new application of digital documents on the Intranets, including comparisons to alternative means of document storage, access and distribution.
OCR Comparison: Word Accuracy on 27 Documents, from memos to magazines.
There are four ways to VIEW DOCUMENTS: 1. Convert to HTML 'on the fly';
2. Use a Viewer to look at native files; 3. Download files for a special
application; 4. Net-centric files.
Text retrieval, Web-enabled document imaging and Internet document
management and philosophy from many vendors, including Adobe, Excalibur, ZyLAB,
Verity, Open Text, Fulcrum and many others.
Beyond OCR: Direct Path from Paper to Rich Electronic Format - PDFPortable Document Format, readable by Windows, Mac, Unix GUI and DOS users.
Get Paper into OCR: Large volume scanning and indexing of paper documents . Kofax has always been a leader in scanning and imaging software performance.
Get Paper into OCR: Large volume scanning and indexing of paper documents . End-to-end document digitization requires systematic tools and procedures, high end Image Processing.
Get Paper into OCR: Large volume scanning and indexing of paper documents . These guys have tons of experience, including DocEx, a document exploitation project conducted during Gulf War.
Beyond OCR: Digital Library creates a Virtual Agency
More than OCR: Digital Library Success Stories This document includes Web Links in the PDF file for instant access to the inspirational sources of the Digital Library story. This story is a guided tour of ongoing SGML, TEI, PDF, SUPRA and other efforts to create digital libraries.
The free Adobe Acrobat Reader latest versions for Windows, Mac, Unix and DOS are always available at http://www.adobe.com or ftp.adobe.com.
The above files were created in Microsoft Word for Windows and converted in two ways. The documents were "printed" via PDF Writer as Acrobat Portable Document Format. And, the documents were imported and saved through SoftQuad HoTMetaL Pro 2.0. The documents themselves are Test Articles for anybody thinking of building a Digital Library. The HTML versions of the OCR Test Reports include tables that collapsed during conversion - HoTMetaL Pro 3 is on the WAY!
Evaluating OCR: WordScan 4.0, the first technology offspring of the new Caere. WordScan 4 is 43% more accurate than WordScan 3 and 58% more accurate than OmniPage Pro 5
Evaluating OCR: Xerox TextBridge Pro 3, ExperVision TypeReader Pro 3 strong on formatting. Productive packages provide quick payoff by capture of format and content.
Beyond OCR: SGML Encoding to Preserve and Provide Deep Access . "We need to think of pages in books on shelves in libraries, not pages in documents in folders on desktops!"
Pioneering OCR: And other paths to electronic documents . One of the world's largest law firms uses all forms of office automation and document processing strategies, and the NETWORK is the Primary Advantage technology.
Early adopters of Internet technology enjoy outstanding business edge . Intelligent agents, video conferencing, and the Web can all confer substantial competitive leverage
If you would like to know more about the author, Resume: Tony McKinley. tmresume.htm
For the past 15 years I have been dedicated to the Noble Task of turning paper documents into digital form. My Hot Rods were the Compuscan, Hendrix and Dest, the Kurzweil Intelligent Scanning System and the Calera Compound Document Processor. They were all OCR monsters of their time. Today's OCR software finally gets us there. And now that the machines have caught up to the task, the World Wide Web is here to publish all the world's libraries on the Internet.
The Inspiration for this page comes from Buckminster Fuller , in his 1962 book Education Automation. In that typically freewheeling talk Bucky proposed a universally accessible digital library that would enable anyone, anywhere to study, learn and grow. Bucky figured that this intellectual freedom of the masses would bring humanity's best ideas to Reality.
This page is full of my field notes, including test images and results from the on-line OCR Lab, as published in Imaging Magazine, Imaging World, and Work Process Improvement TODAY. The goal of all my work is to transform paper documents into digital documents, from paper to bitmaps to SGML immortality in future electronic libraries, universal fonts of accessible knowledge.
A suite of documents is available here for independent testing and review. These are the images that were used in testing for the above articles on Xerox TextBridge, WordScan, OmniPage, TypeReader and Acrobat Capture. These documents were selected to show relative text and page format recognition on a wide variety of applications, from simple fax memos to complicated lists, newspaper and magazine pages. These pages were chosen because they illustrate the particular challenges to any scanning and recognition system, and they provide a basis for comparison among the leading programs. The original images are here for any interested individual to re-create the experiments and compare the results.
Subjects of OCR and Document Understanding Analysis: Independent Testing and Feedback Requested. ocrimage.htm
Results of OCR and Document Understanding Tests: Independent Feedback Requested. ocresult.htm
OCR 70,000 pages per week at 99.985% Accuracy. This article describes a high volume OCR scanning production system, including multiple scanners and ExperVision RTK engines on a network.
Scientific journals scanned and converted to highly structured citation and abstract database. A network based Calera M-Professional scanning and OCR system, and the Quality Management concerns in building scientific secondary publications.
On April 1, 1991, funnily enough, the author and John Solomon installed the initial suite of OCR engines on the Network at the UNLV to evaluate ALL of the best OCR in the World. Besides the OCR Lab here, one of the only other organizations dedicate to test and evaluation of true OCR performance, offers a ton of research papers at for FTP access. ftp.isri.unlv.edu
All of the test images were processed with Acrobat Capture, and are availabe here in uncorrected PDF output. In addition to demonstrating the performance of Acrobat Capture, the freely distributed Acrobat Reader viewer allows users to easily see what the pages look like.
A laser printed (HP LJ/4P, 600 dpi) three column document, in a tiny Times Roman font. The one kind of document that some OCR programs actually recognize at 100 % accuracy.
Of course, the output of computers should be recognizable by computers.
Aviation Week & Space Technology, B-2 color photo. People often say: "Why did they pick this page as an example?" Part of the answer is the perfect example of magazine layout that this page offers, but a big part of the answer involves the way binary images of color magazine pages turn out, and how they make the B-2 virtually indistinguishable in the photo. Stealth via b/w scanning of high quality color image.
A Scientific Journal article, complete with Citation, Title, Authors, Abstract and other special conventions. The text itself contains a lot of Latin and italic, which tests OCR in a linguistic light.
A simple page representative of financial reports and prospectuses. The top of half of the page is simple text, the bottom half of the page is what "should be" a simple chart.
Three faxes: a quote, a price list, a memo.
Okay, here's a full page span article from the New York Times. How does it look to you? This is a very important question.
An example of very complex page layout. Illustrates the fact that info thrown in your face now is difficult to access on an ongoing basis.
A common Statement, a pre-printed form filled in by a high speed printer. It is possible to build systems that read these at close to 100 % accuracy.
PDF "Portable Document Format" Versions of Published Articles
The Best Site on the Web for all things PDF is EMERGE, where
you will find The PDF Zone, the PDF-L mailing list archive (a treasure trove of
questions and answers on PDF, the Capture-L mailing list, and all of the
powerful Plug-Ins for Acrobat.
If you are interested in learning more
about PDF or even trying out the Capture Evaluation program, you should consider
browsing to
EMERGE
.
Originally appeared in Imaging World, 6/95
Caere and Calera, Software Offspring
Originally appeared in Imaging World, 7/95
From Character Recognition to Page Recognition
TextBridge from Xerox and TypeReader from Expervision
Originally appeared in Imaging World, 10/95
End to End Document Digitization
Adobe Acrobat Capture
All feedback on this page should be directed to Tony McKinley via e-mail to tonymck@imagebiz.com. This page is continuously under construction. More previously published articles on recognition and e-docs will soon be added to this page. This page produced by Intelligent Imaging.
Private, limited access, do not download w/out password.
Alternative site:http://sunsite.unc.edu/elvis/elvishom.html The
Elvis Home Page.
Elvis Navigation: Coming Soon! tonymck.gif
There's a very cool fake Elvis GIF attached here, and as soon as the Web gets faster, and we all get ISDN at home, Elvis will be an Image Map beyond compare.
The greatest recorded song ever: Elvis doing "Unchained Melody" live in Vegas.
The second greatest recorded song ever: Whitney Houston doing the Star Spangled Banner at the Super Bowl.
IMHO