Results 1 to 4 of 4

Thread: PDFBox Problem

  1. #1
    aliki_arg is offline Member
    Join Date
    Nov 2011
    Posts
    2
    Rep Power
    0

    Default PDFBox Problem

    Hi!

    I'm using PDFBox in order to convert pdf files to txt. It's working, but my task requires keeping some format information such as bold, italics or character size (this information can be kept as a special character in front of these words). Do you have any idea how to do this??

    Thank you!!

  2. #2
    SourCookie's Avatar
    SourCookie is offline Member
    Join Date
    Nov 2011
    Location
    Los Angeles, California
    Posts
    5
    Rep Power
    0

    Default Re: PDFBox Problem

    Are you trying to modify a PDF or are you just losing font information on extracted text? Is the font information embedded? Do you have any samples of your text extraction code or a PDF you're extracting?

  3. #3
    JosAH's Avatar
    JosAH is online now Moderator
    Join Date
    Sep 2008
    Location
    Voorschoten, the Netherlands
    Posts
    13,651
    Blog Entries
    7
    Rep Power
    21

    Default Re: PDFBox Problem

    Quote Originally Posted by aliki_arg View Post
    I'm using PDFBox in order to convert pdf files to txt. It's working, but my task requires keeping some format information such as bold, italics or character size (this information can be kept as a special character in front of these words). Do you have any idea how to do this??
    I wonder what the values of those "special characters" would be if you want to handle Unicode characters gracefully; also, putting those "special values" in front of "words" forces you to recognize "words" in abitrary text. Why not use HTML for this purpose?

    kind regards,

    Jos
    cenosillicaphobia: the fear for an empty beer glass

  4. #4
    aliki_arg is offline Member
    Join Date
    Nov 2011
    Posts
    2
    Rep Power
    0

    Default Re: PDFBox Problem

    Quote Originally Posted by SourCookie View Post
    Are you trying to modify a PDF or are you just losing font information on extracted text? Is the font information embedded? Do you have any samples of your text extraction code or a PDF you're extracting?

    I'm trying to create a .txt file containing the text and the font information of a pdf file. So I'm trying to find out a way to understand for example which words were bold in the pdf file, when i'm reading the txt file. The initial problem is how to use the PDFbox in order to get font information about the words of a pdf file.

    The code i use for the pdf to txt transformation is:

    public class PDFTextParser {

    PDFParser parser;
    String parsedText;
    PDFTextStripper pdfStripper;
    PDDocument pdDoc;
    COSDocument cosDoc;
    PDDocumentInformation pdDocInfo;

    // PDFTextParser Constructor
    public PDFTextParser() {
    }

    // Extract text from PDF Document
    String pdftoText(String fileName) {

    System.out.println("Parsing text from PDF file " + fileName + "....");
    File f = new File(fileName);

    if (!f.isFile()) {
    System.out.println("File " + fileName + " does not exist.");
    return null;
    }

    try {
    parser = new PDFParser(new FileInputStream(f));
    } catch (Exception e) {
    System.out.println("Unable to open PDF Parser.");
    return null;
    }

    try {
    parser.parse();
    cosDoc = parser.getDocument();
    pdfStripper = new PDFTextStripper();
    pdDoc = new PDDocument(cosDoc);
    parsedText = pdfStripper.getText(pdDoc);
    } catch (Exception e) {
    System.out.println("An exception occured in parsing the PDF Document.");
    e.printStackTrace();
    try {
    if (cosDoc != null) cosDoc.close();
    if (pdDoc != null) pdDoc.close();
    } catch (Exception e1) {
    e.printStackTrace();
    }
    return null;
    }
    System.out.println("Done.");
    return parsedText;
    }

    // Write the parsed text from PDF to a file
    void writeTexttoFile(String pdfText, String fileName) {

    System.out.println("\nWriting PDF text to output text file " + fileName + "....");
    try {
    PrintWriter pw = new PrintWriter(fileName);
    pw.print(pdfText);
    pw.close();
    } catch (Exception e) {
    System.out.println("An exception occured in writing the pdf text to file.");
    e.printStackTrace();
    }
    System.out.println("Done.");
    }

    //Extracts text from a PDF Document and writes it to a text file
    public static void main(String args[]) {

    if (args.length != 2) {
    System.out.println("Usage: java PDFTextParser <InputPDFFilename> <OutputTextFile>");
    System.exit(1);
    }

    PDFTextParser pdfTextParserObj = new PDFTextParser();
    String pdfToText = pdfTextParserObj.pdftoText(args[0]);

    if (pdfToText == null) {
    System.out.println("PDF to Text Conversion failed.");
    }
    else {
    System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText);
    pdfTextParserObj.writeTexttoFile(pdfToText, args[1]);
    }
    }
    }

Similar Threads

  1. Problem with PDFBox loading
    By biguglyjim in forum Advanced Java
    Replies: 6
    Last Post: 09-23-2013, 09:47 AM
  2. PDFBox 1.6 and Java 1.4.2
    By apelleti in forum New To Java
    Replies: 0
    Last Post: 08-09-2011, 11:21 AM
  3. PDFBOX table
    By khenzi07 in forum Advanced Java
    Replies: 0
    Last Post: 07-23-2011, 06:46 AM
  4. Coordinates in PDFBox
    By arunsegar in forum Advanced Java
    Replies: 1
    Last Post: 05-16-2011, 09:41 AM
  5. PDFBox: Problem with converting pdf page into image
    By artfhc in forum Advanced Java
    Replies: 1
    Last Post: 01-03-2011, 07:37 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •