Results 1 to 1 of 1
  1. #1
    davidoudou is offline Member
    Join Date
    Nov 2011
    Posts
    3
    Rep Power
    0

    Lightbulb PDox 1.6.0 Extract text between 2 bookmarks on the same page - SOS!!

    Hi Guys,

    I have a project where I need to extract text between bookmarks from a pdf file and then copy the text to a .txt Several bookmarks can be on the same page...

    As an example:

    PDox 1.6.0 Extract text between 2 bookmarks on the same page - SOS!!-pdfbox.jpg



    I have a code which nearly works (see below), but I have a problem when several bookmarks are on the same page... My guess is that stripper.setStartBookmark sets the start page and not the start coordinate... Could you please confirm ?

    Is there an easy to solve this probem ?

    Is it possible to get the coordiante of my bookmarks and then define a rectangle and use PDFTextStripperby area?

    If PDFbox not the easiest way to do so, I don't mind to change... What about IText ? Something else ?


    here is my code to do so:

    import java.io.BufferedWriter;
    import java.io.File;
    import java.io.FileOutputStream;
    import java.io.IOException;
    import java.io.OutputStreamWriter;
    import org.apache.pdfbox.pdmodel.*;
    import org.apache.pdfbox.pdmodel.interactive.documentnavi gation.outline.PDDocumentOutline;
    import org.apache.pdfbox.pdmodel.interactive.documentnavi gation.outline.PDOutlineItem;
    import org.apache.pdfbox.util.PDFTextStripper;

    public class ExtractText {
    public static void main(String[] args) throws Throwable {
    PDDocument doc = PDDocument.load("C:/sample.pdf");
    PDDocumentOutline root = doc.getDocumentCatalog().getDocumentOutline();
    PDOutlineItem item = root.getFirstChild();
    int i=0;
    while( item != null )
    {
    System.out.println( "Item:" + item.getTitle() );
    PDOutlineItem child = item.getFirstChild();
    while( child != null )
    {
    System.out.println( " Child:" +i+ "Title: "+ child.getTitle() );
    System.out.println( child );

    if (child.getNextSibling()!= null) {
    File output = new File("C:/sampleExtract"+i+".txt");
    BufferedWriter wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output)));
    PDFTextStripper stripper = new PDFTextStripper();
    stripper.setStartBookmark(child);
    stripper.setEndBookmark(child.getNextSibling());
    stripper.writeText(doc, wr);
    wr.close();
    }
    child = child.getNextSibling();
    i++;
    }
    item = item.getNextSibling();
    }
    }

    }
    Last edited by davidoudou; 11-10-2011 at 06:22 PM.

Similar Threads

  1. Extract bookmarks from pdf file using Java
    By sid.thumma in forum Advanced Java
    Replies: 3
    Last Post: 05-24-2011, 05:18 AM
  2. extract text from doc, pdf
    By nn12 in forum New To Java
    Replies: 1
    Last Post: 03-14-2011, 07:30 PM
  3. how to extract video from web page
    By abhishektyagi789 in forum Networking
    Replies: 5
    Last Post: 03-10-2011, 09:29 AM
  4. Extract code of a web page
    By nechalus in forum New To Java
    Replies: 1
    Last Post: 06-03-2010, 01:49 AM
  5. extract text from xhtml page using jdom
    By nijil in forum New To Java
    Replies: 7
    Last Post: 02-23-2010, 07:50 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •