Results 1 to 4 of 4
  1. #1
    user101 is offline Member
    Join Date
    Sep 2010
    Posts
    1
    Rep Power
    0

    Default extract text

    Hi,

    I've been trying to write a program which extracts text from pdfs.

    Any advice?
    Last edited by user101; 09-09-2010 at 01:22 PM.

  2. #2
    markee174 is offline Member
    Join Date
    Sep 2010
    Posts
    3
    Rep Power
    0

    Default Text extraction

    How is the information stored in the file exactly?

    It could either be in the metaheader or the actual text. I wrote the JPedal text extraction so happy to help if you can supply some more details.

    Text extraction is a slightly complicated isuse - there is an article on the JPedal blog explaining why at PDF format and style information | Java PDF Blog

    If you want to use JPedal for text extraction, there is a number of tutorials at Java PDF Extraction Tutorials - Java PDF Library Tutorial - if you need any help please post on the forums and I will try to help you further (https://idrsolutions.fogbugz.com/default.asp?support).

    You can also try the PdfBox forums if you want to use PDFbox - I've always found them a friendly, helpful bunch.

  3. #3
    markee174 is offline Member
    Join Date
    Sep 2010
    Posts
    3
    Rep Power
    0

    Default

    If you want to extract text with JPedal there is a tutorial showing how to use the built-in example (and a link to the source so you can study or change) at PDF to text conversion - Java PDF Library Tutorial You can also find text on the page and extract the page as a set of words and locations for indexing at pdf to text as wordlist - Java PDF Library Tutorial

    PdfBox has a whole load of tutorials at Apache PDFBox - Apache PDFBox - Java PDF Library

    The tricky bit is how to isolate the information on the page. Is it always at the same place or in a certain format - you could look at the XML tagging, content around it. Otherwise I cannot see how you can automate the process.

  4. #4
    markee174 is offline Member
    Join Date
    Sep 2010
    Posts
    3
    Rep Power
    0

    Default

    If you run JPedal Viewer (java -jar jpedal.jar) or double-click on it, you can use it to get the screen co-ordinates (they are displayed bottom left if you move the cursor). You can then extract the text from that zone.

Similar Threads

  1. Replies: 0
    Last Post: 01-27-2010, 04:52 PM
  2. writing to specific line in text file
    By mickmos in forum New To Java
    Replies: 2
    Last Post: 04-18-2009, 02:01 PM
  3. Writing To A Specific Text File Line
    By mokonji in forum New To Java
    Replies: 1
    Last Post: 03-02-2009, 09:13 PM
  4. Research Survey
    By Undergrad in forum New To Java
    Replies: 10
    Last Post: 11-02-2008, 07:08 PM
  5. Replies: 1
    Last Post: 02-04-2008, 09:26 PM

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •