Document image and unstructured full text capture
The next step up once you have captured the image of a page of a document is to attempt to recognise all the text contained on that page using recognition software.
There are a number of reasons why users need to do this. Firstly they may be in a publishing environment where having captured a document they want to edit it and reformat it prior to publishing.
Secondly, users may want to be able to search on the full text of the document. Hence they need to recognise the text and load the text into a search engine.
The key factor about this application is that the text that is being captured is unstructured. It could be in any format. The software is not using any document template to recognise specific document formats - it is just looking at images and trying to identify text characters.
Document image capture and structured form processing
The next major application is structured form processing. Here traditionally the user wishes to scan the images of one or more standard forms and to then capture data from defined positions on the form where users have been asked to write in variable data.
The definition of a form may range from simple cheques or credit card vouchers up through double-sided A4 forms and onto 20 page forms in the case of Census forms.
Key to this traditional forms processing application is the fact that the forms are designed for data capture and are printed as turnaround documents. The user knows that they are going to be sending a form or forms out to customers which the customers then complete and send back for processing. The user wants to reduce the cost and increase the accuracy of the data capture process by designing the form to facilitate scanning and automated data capture.
The user is therefore prepared to design the form so that it is optimised for automated data capture and the form design can be pre-defined to the document and data capture system. The form template is defined to the system so it knows that when it scans an image with those characteristics it is required to look at various boxes and other areas on the form and capture whatever variable data is recorded in those areas of the image.
The benefits of such systems are that when well designed they can deliver fast and low cost and accurate data capture plus archive images so the paper forms can be destroyed.
The disadvantage of such systems is that it takes a long time to design the forms and any change to the design of the forms leads to major system changes which are time consuming and costly. Such systems only work with well-defined turnaround documents printed in high volume.
Document image and semi-structured data capture
The next application is a newer application area designed to address the requirement for users to capture data from a range of forms which they do not produce and hence they do not control.
This market is trying to address applications where all the documents to be processed share common characteristics but differ in detail. Examples would be supplier invoice processing where invoices share common characteristics but differ in the exact areas where the supplier number or the line item data may be held on each different invoice. Another example would be direct debit forms from different gas or electricity regions where there will be some similarities but some differences.


