View Full Version : Searching for a pdf reader web script
adib0313
08-16-2011, 12:40 PM
Hello, im not sure if this is the right place to ask this question or not.. but I have been searching for a particular script that can do the following...
Load pdf file and be able to :
specify and read content and images of certain portion of the pdf file..
For example: say i have a pdf file of a newspaper page and i want to select an article from there and grab the text and the images and have it saved to the database.
The selection needs to recognize the font size and the weight. So it can distinguish the title of the article and the contents.
All these needs to be done on a web interface..
In simple terms i need a PDF cropping script that will be able to read the text and store it..
Please help and advice any possible script that is available which features these functions.
I have been searching on the net but no luck as of yet.
Hopefully i have explained for your understanding.
Thanks
djr33
08-16-2011, 01:44 PM
That will be very difficult. I'm not even sure where you'd start. There are PHP extensions available to create PDF files but I've never seen anything for reading and modifying existing PDF files. PDFs are not designed as an intermediate format: they are designed for final display of information to the user, and sometimes this means that it won't even have the raw information available: it may be password protected, it may be encoded in a strange format or it may have been converted to an image or something along those lines.
That said, this may be possible, or at least may be possible in some cases. However, it probably isn't something you can get a "script" for in any basic sense. Certainly not Javascript. Maybe with PHP or another serverside language (Java might be a good place to look since it's a more general language than PHP and others), but the approach that might be most useful is finding an independent program that can do this and can be run from the command line. Then use PHP or another language to interface with it and you can do what you need. But you still need to actually find a program that does what you want, and I don't know of any, regardless of the web interface aspect. PDFs aren't meant to be taken apart, but maybe there is a program out there that can.
I hope that helps you get started.
One possible alternative to this, if the most important aspect is cropping, is that you can find PDF to image converters. Then you can fairly easily (relatively speaking) crop that image and present it to the user. This still involves some complicated uses of PHP and potentially external programs run through the command line, but it has been done and there are examples out there or software available to help. This will not extract the text for you or "deconstruct" the PDF. Theoretically you could use OCR (image to text recognition) to get text out then but that sounds like a very complicated, probably excessive, approach.
adib0313
08-17-2011, 09:07 AM
Let me explain it a little simpler. All I am looking for is a script that can read pdf's file text and be able to identify the text size and the weight so it can identify what is the title and the content.
I dont need it to convert it to image. Just be able to recognize portion of the pdf file that i select.
djr33
08-17-2011, 01:47 PM
I explained above: this is very difficult and not the intent of PDFs. They are a visual display format and their code is not meant to be accessible (in fact, it seems just the opposite). I understand that images may not help you, but I don't know of another way to extract the information. Most of my post (except the "alternative" offered at the end) is NOT about images-- read that carefully to begin.
The best luck you'll have will be finding an external executable file (program) that can do what you need then using a serverside language like PHP to interface with it. There is no simple answer, but that might work.
adib0313
08-18-2011, 08:53 AM
I have seen some php scripts that grabs the text from pdf. It basically extracts the text but this didnt work for pdf files that contains images.
I have also seen that there are scripts to convert pdf to html which will also help me since it does recognize the fonts and the images but all i have seen are windows software that does it.
So it seems like it is possible to extract contents from pdf, i would just like to know if there are some scripts which i haven't ran into while searching for a similar script..
I have developers who can modify a script but i need the base where the core functionality is being used.
djr33 thanks for your input, if i can somehow get you to help me find a good solution to this then that would be great because you might have a different technique to find scripts. Open source scripts are preferred.
Thanks!
djr33
08-18-2011, 10:00 AM
If you have seen these scripts, then that's where to start. I'm guessing they rely on an external program used via PHP with command line controls. That's fine.
If a PHP-only solution exists, I would be skeptical about it. The benefit of PDF is that it is consistent, and any non-professional PDF software may not function with certain aspects of PDFs. You can probably get some data out, but you should fully test before you rely on it as a way to read any PDF. This may apply to any solution you find, even a paid one. Read the information about it carefully.
There's also the fact that some encodings may use special embedded fonts or even images so there's no way to be certain that you have extracted all of the information, since PDFs are strictly a visual end-format. But you can probably come close if you find the right methods.
My only experience is with generating PDFs in PHP, though I have casually looked for PDF readers and not found much. And for generating PDFs in PHP (with or without an external command line utility) they tend to be lacking certain important aspects, such as, from one example in particular, limitations with tables.
I hope this is helping. I don't mean to sound discouraging: this is just hard. It would be great if PDFs were a more open format but unfortunately they're not. (That may be part of why they are so consistent, but it's still frustrating.)
Powered by vBulletin® Version 4.2.2 Copyright © 2021 vBulletin Solutions, Inc. All rights reserved.