All about problems with PDF Files

Moderator: Jim Bretti

Post Reply
Jim Bretti
Posts: 1558
Joined: Wed Oct 29, 2003 11:07 am
Contact:

All about problems with PDF Files

Post by Jim Bretti »

In general there are a few things that can cause problems with PDF files, and then a few things you can do to work around them. The general problems are

1)The PDF files author has turned on copy protection or DRM to prevent the text from being extracted, so people won't copy it. This has the side effect of not letting us get the text to be spoken. However, you may be able to use the speech functions built into Adobe reader along with our voices to have the document spoken.

In some cases with PDF files like these, while no other program is allowed to open them, from within Adobe Reader, you can copy text to the clipboard, then paste into TExtAloud as an article. This is commonly the case with the PDFs from online universities.

2) Some PDF files aren't actually text, but are actually images scanned from a book or document. The only way TextAloud can handle this kind of text is to first convert the image to a text document. You'll need something called OCR (Optical Character Recognition) to do this. TextAloud does not have any OCR capability at all, but it is available in some document conversion utilities. The full version of Adobe Acrobat includes OCR capability (not the Adobe Reader product). OCR features are also available in Solid OCR (http://www.soliddocuments.com/solid-ocr.htm), Nuance Omnipage (http://www.nuance.com/for-individuals/b ... /index.htm) and others. There are also some OCR cloud services available if you search for them.

In general, if you import a PDF document into TextAloud and you get no text at all, just 'garbage' characters, there is a good chance that the document is actually an image, and you'll need to use OCR to convert it to text.

3) Depending on how the PDF document was actually created, it is possible that TextAloud may not do a good job at extracting the text. For example, spaces between words may be lost, or letters / words in the document may be out of sequence. If this happens the only alternative is try anther pdf text extraction utility and see if you have better luck. You may find another utility that handles the text extraction better, and saves the text as a .txt / .doc / .html file, which you can then import into TextAloud.

If you need to try another PDF text extraction tool we recommend trying Solid PDF to Word, at http://www.soliddocuments.com/products. ... dPDFtoWord. We've had good luck with this product and there is a trial version available. If you happen to have Office 2013 or later installed, another option is to load the PDF document into Word. Starting with Office 2013, Microsoft Word is able to convert PDF documents. Save the converted Word document, and see if TextAloud is able to load it.

One other thing to note is, if the PDF is text, but has such tight copy protection on it that there is no way to get the text out to TextAloud, Adobe does have some limitted speaking capabilities call "Read Out Loud" under the View menu. This function uses the default voice in windows, but you can upgrade that voice with any of the ones we sell at http://nextup.com/TextAloud/SpeechEngine/voices.html
Jim Bretti
NextUp.com
Post Reply