Hi,
I've been trying to write a program which extracts text from pdfs.
Any advice?
Printable View
Hi,
I've been trying to write a program which extracts text from pdfs.
Any advice?
How is the information stored in the file exactly?
It could either be in the metaheader or the actual text. I wrote the JPedal text extraction so happy to help if you can supply some more details.
Text extraction is a slightly complicated isuse - there is an article on the JPedal blog explaining why at PDF format and style information | Java PDF Blog
If you want to use JPedal for text extraction, there is a number of tutorials at Java PDF Extraction Tutorials - Java PDF Library Tutorial - if you need any help please post on the forums and I will try to help you further (https://idrsolutions.fogbugz.com/default.asp?support).
You can also try the PdfBox forums if you want to use PDFbox - I've always found them a friendly, helpful bunch.
If you want to extract text with JPedal there is a tutorial showing how to use the built-in example (and a link to the source so you can study or change) at PDF to text conversion - Java PDF Library Tutorial You can also find text on the page and extract the page as a set of words and locations for indexing at pdf to text as wordlist - Java PDF Library Tutorial
PdfBox has a whole load of tutorials at Apache PDFBox - Apache PDFBox - Java PDF Library
The tricky bit is how to isolate the information on the page. Is it always at the same place or in a certain format - you could look at the XML tagging, content around it. Otherwise I cannot see how you can automate the process.
If you run JPedal Viewer (java -jar jpedal.jar) or double-click on it, you can use it to get the screen co-ordinates (they are displayed bottom left if you move the cursor). You can then extract the text from that zone.