Results 1 to 1 of 1
- 11-10-2011, 06:18 PM #1
Member
- Join Date
- Nov 2011
- Posts
- 3
- Rep Power
- 0
PDox 1.6.0 Extract text between 2 bookmarks on the same page - SOS!!
Hi Guys,
I have a project where I need to extract text between bookmarks from a pdf file and then copy the text to a .txt Several bookmarks can be on the same page...
As an example:

I have a code which nearly works (see below), but I have a problem when several bookmarks are on the same page... My guess is that stripper.setStartBookmark sets the start page and not the start coordinate... Could you please confirm ?
Is there an easy to solve this probem ?
Is it possible to get the coordiante of my bookmarks and then define a rectangle and use PDFTextStripperby area?
If PDFbox not the easiest way to do so, I don't mind to change... What about IText ? Something else ?
here is my code to do so:
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.pdmodel.interactive.documentnavi gation.outline.PDDocumentOutline;
import org.apache.pdfbox.pdmodel.interactive.documentnavi gation.outline.PDOutlineItem;
import org.apache.pdfbox.util.PDFTextStripper;
public class ExtractText {
public static void main(String[] args) throws Throwable {
PDDocument doc = PDDocument.load("C:/sample.pdf");
PDDocumentOutline root = doc.getDocumentCatalog().getDocumentOutline();
PDOutlineItem item = root.getFirstChild();
int i=0;
while( item != null )
{
System.out.println( "Item:" + item.getTitle() );
PDOutlineItem child = item.getFirstChild();
while( child != null )
{
System.out.println( " Child:" +i+ "Title: "+ child.getTitle() );
System.out.println( child );
if (child.getNextSibling()!= null) {
File output = new File("C:/sampleExtract"+i+".txt");
BufferedWriter wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output)));
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartBookmark(child);
stripper.setEndBookmark(child.getNextSibling());
stripper.writeText(doc, wr);
wr.close();
}
child = child.getNextSibling();
i++;
}
item = item.getNextSibling();
}
}
}Last edited by davidoudou; 11-10-2011 at 06:22 PM.
Similar Threads
-
Extract bookmarks from pdf file using Java
By sid.thumma in forum Advanced JavaReplies: 3Last Post: 05-24-2011, 05:18 AM -
extract text from doc, pdf
By nn12 in forum New To JavaReplies: 1Last Post: 03-14-2011, 07:30 PM -
how to extract video from web page
By abhishektyagi789 in forum NetworkingReplies: 5Last Post: 03-10-2011, 09:29 AM -
Extract code of a web page
By nechalus in forum New To JavaReplies: 1Last Post: 06-03-2010, 01:49 AM -
extract text from xhtml page using jdom
By nijil in forum New To JavaReplies: 7Last Post: 02-23-2010, 07:50 PM


LinkBack URL
About LinkBacks
Reply With Quote
Bookmarks