Image File Formats
Media Server can process most file formats. However, the file format can affect the accuracy of the OCR.
Images with Lossy Compression
The compression scheme used by JPEGs is generally 'lossy' – information is discarded in order to reduce the file size. The information that is discarded is from the fine details of the image. This is usually not an issue for photographs, but for images of text, the compression generally results in significant blurring of letter edges. Blurred edges make the characters much harder to recognize, which reduces OCR accuracy. Media Server attempts to compensate for blurring; however, it is better to avoid the issue entirely by using a different type of image compression, if possible.
JPEG compression is also occasionally used inside TIFFs. GIFs are also lossy, although only precise color information is lost, which is unlikely to affect OCR accuracy.
Document File Formats
Document file formats, such as PDF, contain both image objects and text elements. Image objects are images (such as photos, or scans of printed documents) that a PDF viewer only displays. Text elements are stored internally as text (for example, ASCII or UTF-8), but are rendered by a PDF viewer as readable text (just as a word processor does). Media Server only runs OCR on image objects, because it uses KeyView to extract the text from text elements.