Why are certain file types not available for non-OCR'able languages?

Asked by Christine Wagner

A library partner noticed that two of their items did not show all of the file types available with all of their other items:
http://www.archive.org/details/nir1955yeshiva
http://www.archive.org/details/nir1957yeshiva

The items seemed to have derived okay and when I went into the item to look at the files I noticed that the .djvu and _djvu.txt files were not there. I asked Paul about this who told me that "The reason these are missing the djvu files is because the language is not currently OCR'able. Only items which are OCR'd get those derivatives." The items are listed as having only Hebrew as a language.

I told the library partner this on November 18 and they sent an email today with the following:

"I would just like to make sure I fully understand the response that file types other
than 'read online" or PDF or aren't available for Hebrew texts since they include
OCR. I'm a little confused because I thought (perhaps incorrectly) since I'm able
to search the contents of the PDF and online file types (though not in Hebrew),
these are also OCR'd. If so, then so I can explain this to my boss when she asks,
why would a Hebrew-only volume not be just viewable (though not searchable) on a
Daisy or Kindle reader or through the very nifty DJVu software...what am I missing?"

I would like to provide them with an answer so that they feel comfortable that they understand why those file types aren't available.

Thanks!

Question information

Language:
English Edit question
Status:
Solved
For:
Internet Archive - Tech Support Edit question
Assignee:
No assignee Edit question
Solved by:
Hank Bromley
Solved:
Last query:
Last reply:
Revision history for this message
Best Hank Bromley (hank-archive) said :
#1

I don't understand what kind of searching the partner is able to do in the PDF or BookReader for these items - with no OCR'd text, there's nothing to search.

Those items can't be made into the eBook formats (EPUB, Daisy, MOBI) because those are text-based formats. If all we have are images, there's nothing to make the eBook out of.

As for DjVu, in principle, we could make a DjVu with no embedded text (just like a PDF with no text layer), but as a policy matter we haven't been doing that. We've had relatively little interest from our users in the DjVu format, and the main reason we make it at all now is that it's an intermediate step in generating the djvu.txt "plain text" format. Once we start generating djvu.txt directly from djvu.xml, bypassing the .djvu file, we'll probably stop making .djvu files altogether.

Revision history for this message
Christine Wagner (christine-w) said :
#2

Thanks Hank Bromley, that solved my question.