I have many A4 pads of handwritten notes, which I would like to convert into Microsoft Word documents. To type them all in would take a very long time. I’ve noticed that Google’s ability to read text from photos has vastly improved in recent months. Are you aware of a tool from Google or anyone else that can do a good job of this, please? Michael
The idea of converting written or printed text into digital text is generally called OCR for optical character recognition, and it has similar problems to speech recognition. That is to say, if the input is close to perfect, the output can also be close to perfect.
But in practice, it works best when dealing with restricted inputs and/or limited domains. For example, it’s possible to recognise the English names for numbers and the names of major UK cities, especially if you can get people to write each letter in its own little box. The same software wouldn’t have the domain expertise to cope with a Russian-speaking coroner who liked to include Sanskrit quotations in his handwritten autopsies.
OCR works best with high-quality printed materials and worst of all with handwriting, so you’re not starting from the best position. In my experience, you can only get handwriting recognition to work well enough by doing it in real time. That enables you to train the software to recognise your input, while the software also trains you to write characters in ways that it can understand. I’ve had some success with this approach, starting more than a decade ago with Microsoft OneNote (which can also record your voice in sync) running on Windows XP Tablet Edition, and more recently with a Livescribe Echo digital pen and MyScript software. However, all this has more to do with keyboard replacement strategies than with OCR.
It’s generally agreed that the best OCR programs are Abbyy FineReader (£99) and Nuance’s OmniPage 18 (£79.99) and Ultimate (£169.99), though neither is suitable for cursive handwriting recognition. Both companies offer free trial versions so you can test them before you splash out. There’s also CharacTell’s SoftWriting ($49.95), which the company says is for students taking notes in class and professionals taking notes in meetings. But it also says it is designed “for recognising non-connected handwriting and machine-printed text” (their emphasis) so I wouldn’t bet on it reading your handwritten notes.
Like most if not all the programs in this field, SoftWriting has to be trained to recognise your handwriting. When it is processing a document, it will present you with words it doesn’t recognise, so that you can tell it what they are. If you have 250 words on a page and the program miraculously gets 90% of them right, you will still have to correct 25 words.
If you want to try a few pages as an experiment, then you can download FreeOCR for Windows, though be careful not to install any crapware that may be included. FreeOCR is based on the widely used Tesseract OCR engine, which was originally developed by Hewlett-Packard in England in the 1980s. HP made it open source in 2005, and Google now maintains the source code.
You can also use FreeOCR online by uploading PDF files to free-ocr.com. Google Docs and various other services also use the same Tesseract OCR engine.
Wikipedia warns that “Tesseract’s output will be very poor quality if the input images are not preprocessed to suit it: Images (especially screenshots) must be scaled up such that the text x-height is at least 20 pixels, any rotation or skew must be corrected or no text will be recognized, low-frequency changes in brightness must be high-pass filtered, or Tesseract’s binarization stage will destroy much of the page, and dark borders must be manually removed, or they will be misinterpreted as characters.”
PDFs and scanners
Your handwritten notes would be more useful in Microsoft Word format because you could do lots of things with them. For example, you could change the typeface, size and spacing, correct and amend your notes, add illustrations, and so on. But unless you have extremely neat, clear and very consistent handwriting, that probably won’t be possible. Instead, think about converting them to high-quality, scanned PDF files that you can store on a hard drive or in the cloud.
You can feed these PDF files to OCR software and hope that it will recognize enough words to make your notes searchable. If not, you will probably have to tag them manually. Either way, if someone does come up with an OCR program that can read your handwriting – not impossible, though I’ve already waited 30 years for one – you will be ready with sharp PDF files, rather than curling originals where the paper has aged and the ink has faded.
Of course, if you are going to scan your notes then you must already have a scanner, or be prepared to buy one. A cheap Epson or Canon flat-bed scanner should give good results, though it is time-consuming to scan a lot of pages. If you intend to do a lot of scanning, consider a sheet-fed model like the Brother ADS-2100 (from £222). You can also get scanners that include OCR, such as Fujitsu ScanSnap iX500 Duplex (from £352), which scans both sides of the paper at once. (The scanner’s OCR software usually runs on your PC.)
If you have to buy a decent scanner and perhaps good quality OCR software for a one-off project, add up the cost and divide it by the number of pages of notes to find the cost per page. It’s a boring job, so perhaps you should add the cost of your time. The result might prompt you to abandon the whole idea, or start looking for a company to do it for you.
Most of the companies that provide scanning services cater for businesses that need to clear away large volumes of paper records. However, some cater for low-volume and home users. One example is Oxford-based Scanning Geeks, which charges 25p per page for documents up to A3 in size. (One page means one side of a page.) They can do OCR (“Textual Data Capture”) as well. Ideally, find a good local company where you can drop off your notes securely and collect them afterwards.
It’s an expensive route if you have lots of paper: it could cost £3,000 to scan the contents of a four-drawer filing cabinet. But if you only have 100 to 500 pages of notes to scan, it could be the best option.
guardian.co.uk © Guardian News & Media Limited 2010