|
Introduction
The DCD project would welcome proposals for contributions of further
documents to go on its site which relate to its aims and objectives.
If you are thinking of making such a contribution please get in
contact with us. Do bear in mind the site is not a money making
venture, we cannot pay for documents to go on it, nor charge for
people to access the documents. Also due to copyright legislation
the documents digitised need to be out of copyright original documents
and it seems simplest when you also own the materials yourself.
How to Prepare a Document for Digitisation
The following is a brief description of how to digitise a document
based on what we have learned during the project.
The best place to start from is an original copy of a book, not
a photocopy or a reprint, both of which although to the naked eye
look fine are not as sharp as the original in terms of the defintion
of the print in them.
The OCR (Optical Character Recognition) process works best when
the images are saved as simply black and white images (greyscale
is not an advantage), at 300 dpi in tiff format, some scanner software
includes an option for 'background removal' or some such phrase
as well, when the text is intended for OCR work, and it’s
a good option to choose.
Flat bed scanners generally give good clear images which are well
focused and work well as long as you, or your library's conservators,
are not concerned about the book being fully opened. You will find
that some scanners are much quicker than others to make a scan of
a page its worth trying different ones out to see not only how long
it takes to make a scan, but how long it takes before you can make
the next scan, something the specifications never seem to tell you.
If your book is tightly bound another option is to find somebody
with an overhead scanner for books. However, the focus on these
is often hard to get right, and consequently this is a problem when
making OCR versions of the texts.
We can easily make the pdf images to go on the web, its just a
matter of supplying us with the tiff files and us running them through
a batch file process in Photoshop, or if you have photoshop, or
photoshop elements, you can do it yourself. Note that the image
files are easiest to handle when they are named in forms like 000,
001, 002, etc. so that they order correctly when viewed in folders
on computers.
The OCR Process
In regards to the OCR process. This works best when done in a software
package like Fine Reader, which costs around $250 (Australian) for
an academic license, but the best OCR packages keep changing, the
previous year the best ordinary product was Omniscan. There are
also very much more expensive packages, but these are not totally
essential for most purposes.
Experience suggests that its actually the time it takes somebody
to proof read the text in the OCR package that takes the most time
and is the most costly element in the process of making a digital
version of a document. Quite how long it takes is unpredictable.
A big factor is the quality of the printing and the type of paper
in the original. If the paper is an early 19th century rag paper
its generally much better than a mid 19th century wood pulp paper.
If the ink has bled into the paper making the characters slightly
fuzzy its a problem for the OCR software. Also some fonts are easier
to OCR than others, for instance in some character combinations
such as 'fi' are printed very close to each other and its a problem
to recognise them. Finally its apparrent that if there are lots
of tables this slows the process down a lot, plain text is much
easier to OCR.
Most OCR packages also offer the possibilty to save in a number
of formats. The best format for us is a plain HTML file with no
fonts or other formating in it, and without preserving the original
line breaks. You also need to save the text in the form of one HTML
file per original page.
If you save the files as Word documents they will then need lots
of extra processing to remove Microsoft specific code and formating
from them, so its best to avoid that if you can.
The best way to then check the documents read okay in a web browser
is to edit them in an HTML editor like DreamWeaver. Often that way
you can pick up issues like whether there are line breaks which
need to be removed which have got left in the text, or odd characters
which need to be resolved.
Turning the HTML Pages into a Database
So hopefully at this point you can make a CD with a folder with
Tiff images of the original pages and another folder with HTML files
of the OCR versions of the pages. If you supply that to us we can
then run it through some software I wrote which allows us to take
separate pages of HTML and combine them into database table. Alternately
I would be happy to let you have a copy of the software so you could
make the table yourself. The pages in the form of a Database table
can then be read by many Database programs. In our case they can
be imported into the FileMaker Database we run on our server and
its then possible to make up fields which are needed in the database,
basically just the text page, the page number, the title and other
bibiographic details and some fields for housekeeping in order to
direct users to where the image files are stored on the server.
Conclusion
If you are interested in perhaps doing some, or all of this, on
a particular document we would be happy to discuss with you whether
we would be interested in hosting the document on the DCD website
and more details of how to prepare the document.
Dr Peter G. Friedlander
Asian Studies
La Trobe University, VIC 3086
Australia
Tel: 61 + 3 9479 2064
Fax: 61 + 3 9479 1880
Email: p.friedlander@latrobe.edu.au
|