Results 1 to 4 of 4
Thread: PDFBox Problem
- 11-24-2011, 02:50 PM #1
Member
- Join Date
- Nov 2011
- Posts
- 2
- Rep Power
- 0
PDFBox Problem
Hi!
I'm using PDFBox in order to convert pdf files to txt. It's working, but my task requires keeping some format information such as bold, italics or character size (this information can be kept as a special character in front of these words). Do you have any idea how to do this??
Thank you!!
- 11-27-2011, 03:28 AM #2
Re: PDFBox Problem
Are you trying to modify a PDF or are you just losing font information on extracted text? Is the font information embedded? Do you have any samples of your text extraction code or a PDF you're extracting?
- 11-27-2011, 10:33 AM #3
- Join Date
- Sep 2008
- Location
- Voorschoten, the Netherlands
- Posts
- 11,397
- Blog Entries
- 7
- Rep Power
- 17
Re: PDFBox Problem
I wonder what the values of those "special characters" would be if you want to handle Unicode characters gracefully; also, putting those "special values" in front of "words" forces you to recognize "words" in abitrary text. Why not use HTML for this purpose?
kind regards,
JosWhen people rob a bank they get a penalty; when banks rob people they get a bonus.
- 11-30-2011, 12:23 PM #4
Member
- Join Date
- Nov 2011
- Posts
- 2
- Rep Power
- 0
Re: PDFBox Problem
I'm trying to create a .txt file containing the text and the font information of a pdf file. So I'm trying to find out a way to understand for example which words were bold in the pdf file, when i'm reading the txt file. The initial problem is how to use the PDFbox in order to get font information about the words of a pdf file.
The code i use for the pdf to txt transformation is:
public class PDFTextParser {
PDFParser parser;
String parsedText;
PDFTextStripper pdfStripper;
PDDocument pdDoc;
COSDocument cosDoc;
PDDocumentInformation pdDocInfo;
// PDFTextParser Constructor
public PDFTextParser() {
}
// Extract text from PDF Document
String pdftoText(String fileName) {
System.out.println("Parsing text from PDF file " + fileName + "....");
File f = new File(fileName);
if (!f.isFile()) {
System.out.println("File " + fileName + " does not exist.");
return null;
}
try {
parser = new PDFParser(new FileInputStream(f));
} catch (Exception e) {
System.out.println("Unable to open PDF Parser.");
return null;
}
try {
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
parsedText = pdfStripper.getText(pdDoc);
} catch (Exception e) {
System.out.println("An exception occured in parsing the PDF Document.");
e.printStackTrace();
try {
if (cosDoc != null) cosDoc.close();
if (pdDoc != null) pdDoc.close();
} catch (Exception e1) {
e.printStackTrace();
}
return null;
}
System.out.println("Done.");
return parsedText;
}
// Write the parsed text from PDF to a file
void writeTexttoFile(String pdfText, String fileName) {
System.out.println("\nWriting PDF text to output text file " + fileName + "....");
try {
PrintWriter pw = new PrintWriter(fileName);
pw.print(pdfText);
pw.close();
} catch (Exception e) {
System.out.println("An exception occured in writing the pdf text to file.");
e.printStackTrace();
}
System.out.println("Done.");
}
//Extracts text from a PDF Document and writes it to a text file
public static void main(String args[]) {
if (args.length != 2) {
System.out.println("Usage: java PDFTextParser <InputPDFFilename> <OutputTextFile>");
System.exit(1);
}
PDFTextParser pdfTextParserObj = new PDFTextParser();
String pdfToText = pdfTextParserObj.pdftoText(args[0]);
if (pdfToText == null) {
System.out.println("PDF to Text Conversion failed.");
}
else {
System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText);
pdfTextParserObj.writeTexttoFile(pdfToText, args[1]);
}
}
}
Similar Threads
-
PDFBox 1.6 and Java 1.4.2
By apelleti in forum New To JavaReplies: 0Last Post: 08-09-2011, 11:21 AM -
PDFBOX table
By khenzi07 in forum Advanced JavaReplies: 0Last Post: 07-23-2011, 06:46 AM -
Problem with PDFBox loading
By biguglyjim in forum Advanced JavaReplies: 5Last Post: 06-07-2011, 11:47 PM -
Coordinates in PDFBox
By arunsegar in forum Advanced JavaReplies: 1Last Post: 05-16-2011, 09:41 AM -
PDFBox: Problem with converting pdf page into image
By artfhc in forum Advanced JavaReplies: 1Last Post: 01-03-2011, 07:37 PM


LinkBack URL
About LinkBacks
Reply With Quote

Bookmarks