Internet Archive - Tech Support

Why are certain file types not available for non-OCR'able languages?

Asked by Christine Wagner on 2010-11-30

A library partner noticed that two of their items did not show all of the file types available with all of their other items:
http://www.archive.org/details/nir1955yeshiva
http://www.archive.org/details/nir1957yeshiva

The items seemed to have derived okay and when I went into the item to look at the files I noticed that the .djvu and _djvu.txt files were not there. I asked Paul about this who told me that "The reason these are missing the djvu files is because the language is not currently OCR'able. Only items which are OCR'd get those derivatives." The items are listed as having only Hebrew as a language.

I told the library partner this on November 18 and they sent an email today with the following:

"I would just like to make sure I fully understand the response that file types other
than 'read online" or PDF or aren't available for Hebrew texts since they include
OCR. I'm a little confused because I thought (perhaps incorrectly) since I'm able
to search the contents of the PDF and online file types (though not in Hebrew),
these are also OCR'd. If so, then so I can explain this to my boss when she asks,
why would a Hebrew-only volume not be just viewable (though not searchable) on a
Daisy or Kindle reader or through the very nifty DJVu software...what am I missing?"

I would like to provide them with an answer so that they feel comfortable that they understand why those file types aren't available.

Thanks!

Question information

Language:: English Edit question

Status:: Solved

For:: Internet Archive - Tech Support Edit question

Assignee:: No assignee Edit question

Solved by:: Hank Bromley

Solved:: 2010-12-13

Last query:: 2010-12-13

Last reply:: 2010-12-09

Link existing bug

Revision history for this message

Hank Bromley (hank-archive) said on 2010-12-09:

I don't understand what kind of searching the partner is able to do in the PDF or BookReader for these items - with no OCR'd text, there's nothing to search.

Those items can't be made into the eBook formats (EPUB, Daisy, MOBI) because those are text-based formats. If all we have are images, there's nothing to make the eBook out of.

As for DjVu, in principle, we could make a DjVu with no embedded text (just like a PDF with no text layer), but as a policy matter we haven't been doing that. We've had relatively little interest from our users in the DjVu format, and the main reason we make it at all now is that it's an intermediate step in generating the djvu.txt "plain text" format. Once we start generating djvu.txt directly from djvu.xml, bypassing the .djvu file, we'll probably stop making .djvu files altogether.

Revision history for this message

Christine Wagner (christine-w) said on 2010-12-13:

Thanks Hank Bromley, that solved my question.

To post a message you must log in.

Ask a question

Edit question

Internet Archive - Tech Support

Why are certain file types not available for non-OCR'able languages?

Question information

Related bugs

Related FAQ:

Subscribers