La Trobe home Digital Colonial Documents
Bypass navigation and go to content

Home
Census Reports
1871 Report
1881 Report
1891 Report
1901 Report
Murray's Guide
Hunter Report
Mill's History
Vade Mecum
East Bengal
India Handbook
Gilchrist's Guide
Keene's Guide
About Us
FAQs
  Contact




How to Make Digital Documents to go on the web

Introduction

The DCD project would welcome proposals for contributions of further documents to go on its site which relate to its aims and objectives. If you are thinking of making such a contribution please get in contact with us. Do bear in mind the site is not a money making venture, we cannot pay for documents to go on it, nor charge for people to access the documents. Also due to copyright legislation the documents digitised need to be out of copyright original documents and it seems simplest when you also own the materials yourself.

How to Prepare a Document for Digitisation

The following is a brief description of how to digitise a document based on what we have learned during the project.

The best place to start from is an original copy of a book, not a photocopy or a reprint, both of which although to the naked eye look fine are not as sharp as the original in terms of the defintion of the print in them.

The OCR (Optical Character Recognition) process works best when the images are saved as simply black and white images (greyscale is not an advantage), at 300 dpi in tiff format, some scanner software includes an option for 'background removal' or some such phrase as well, when the text is intended for OCR work, and it’s a good option to choose.

Flat bed scanners generally give good clear images which are well focused and work well as long as you, or your library's conservators, are not concerned about the book being fully opened. You will find that some scanners are much quicker than others to make a scan of a page its worth trying different ones out to see not only how long it takes to make a scan, but how long it takes before you can make the next scan, something the specifications never seem to tell you.

If your book is tightly bound another option is to find somebody with an overhead scanner for books. However, the focus on these is often hard to get right, and consequently this is a problem when making OCR versions of the texts.

We can easily make the pdf images to go on the web, its just a matter of supplying us with the tiff files and us running them through a batch file process in Photoshop, or if you have photoshop, or photoshop elements, you can do it yourself. Note that the image files are easiest to handle when they are named in forms like 000, 001, 002, etc. so that they order correctly when viewed in folders on computers.

The OCR Process

In regards to the OCR process. This works best when done in a software package like Fine Reader, which costs around $250 (Australian) for an academic license, but the best OCR packages keep changing, the previous year the best ordinary product was Omniscan. There are also very much more expensive packages, but these are not totally essential for most purposes.

Experience suggests that its actually the time it takes somebody to proof read the text in the OCR package that takes the most time and is the most costly element in the process of making a digital version of a document. Quite how long it takes is unpredictable. A big factor is the quality of the printing and the type of paper in the original. If the paper is an early 19th century rag paper its generally much better than a mid 19th century wood pulp paper. If the ink has bled into the paper making the characters slightly fuzzy its a problem for the OCR software. Also some fonts are easier to OCR than others, for instance in some character combinations such as 'fi' are printed very close to each other and its a problem to recognise them. Finally its apparrent that if there are lots of tables this slows the process down a lot, plain text is much easier to OCR.

Most OCR packages also offer the possibilty to save in a number of formats. The best format for us is a plain HTML file with no fonts or other formating in it, and without preserving the original line breaks. You also need to save the text in the form of one HTML file per original page.

If you save the files as Word documents they will then need lots of extra processing to remove Microsoft specific code and formating from them, so its best to avoid that if you can.

The best way to then check the documents read okay in a web browser is to edit them in an HTML editor like DreamWeaver. Often that way you can pick up issues like whether there are line breaks which need to be removed which have got left in the text, or odd characters which need to be resolved.

Turning the HTML Pages into a Database

So hopefully at this point you can make a CD with a folder with Tiff images of the original pages and another folder with HTML files of the OCR versions of the pages. If you supply that to us we can then run it through some software I wrote which allows us to take separate pages of HTML and combine them into database table. Alternately I would be happy to let you have a copy of the software so you could make the table yourself. The pages in the form of a Database table can then be read by many Database programs. In our case they can be imported into the FileMaker Database we run on our server and its then possible to make up fields which are needed in the database, basically just the text page, the page number, the title and other bibiographic details and some fields for housekeeping in order to direct users to where the image files are stored on the server.

Conclusion

If you are interested in perhaps doing some, or all of this, on a particular document we would be happy to discuss with you whether we would be interested in hosting the document on the DCD website and more details of how to prepare the document.

Dr Peter G. Friedlander
Asian Studies
La Trobe University, VIC 3086
Australia
Tel: 61 + 3 9479 2064
Fax: 61 + 3 9479 1880
Email: p.friedlander@latrobe.edu.au

 


Page maintained by: Project Officer
Last Updated: 25 July, 2006



Project Partners
Curtin University
University of New England
La Trobe University
University of Sydney

Related Links
State Library of Victoria
DSAL Chicago