Extract PDF keywords

Asked by bouvard

Hey, I just came across Referencer and I think it has the potential to be a really great organizational tool. I'm not in the sciences, but I'm interested in using it as a general purpose organizational utility for a large archive of documents that have been scanned to PDF. I was wondering about the possibility of extracting the embedded keywords field from the pdf metadata and using it to auto-generate tags. I was hoping that I could implement this as a python plugin, but it doesn't look like the the hooks you have implemented extend that far. I'm afraid I don't have the chops with C/C++ to hack that out myself, but just browsing through the code I notice that your PDF import code uses poppler, which I'm reasonably certain supports extracting the keyword data and that should make the whole thing fairly trivial to implement. Thanks in advance for any consideration you give this.

Question information

English Edit question
Referencer Edit question
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Launchpad Janitor (janitor) said :

This question was expired because it remained in the 'Open' state without activity for the last 15 days.

Revision history for this message
John S (jcspray) said :

Most PDFs that I have come across don't have anything meaningful in the metadata other than "Created with Adobe Distiller" or somesuch, so it is ignored. It sounds like you have quite particular requirements, outside the design goal of streamlining working with academic documents, so I'm afraid you're on your own.

Can you help with this problem?

Provide an answer of your own, or ask bouvard for more information if necessary.

To post a message you must log in.