Results 1 to 14 of 14

Thread: read pdf

  1. #1
    j2me64's Avatar
    j2me64 is offline Senior Member
    Join Date
    Sep 2009
    Location
    Zurich, Switzerland
    Posts
    962
    Rep Power
    5

    Default read pdf

    i want to read a pdf file and output only the text in a console. how to accomplish this? perhaps somebody can give a hint, even with pdfbox-1.1.0.jar?

  2. #2
    JosAH's Avatar
    JosAH is offline Moderator
    Join Date
    Sep 2008
    Location
    Voorschoten, the Netherlands
    Posts
    13,352
    Blog Entries
    7
    Rep Power
    20

    Default

    Quote Originally Posted by j2me64 View Post
    i want to read a pdf file and output only the text in a console. how to accomplish this? perhaps somebody can give a hint, even with pdfbox-1.1.0.jar?
    There are some open source libraries available that can read .pdf files and extract the text portions from it; google is your friend here: "Java pdf read" shows the results.

    kind regards,

    Jos

  3. #3
    j2me64's Avatar
    j2me64 is offline Senior Member
    Join Date
    Sep 2009
    Location
    Zurich, Switzerland
    Posts
    962
    Rep Power
    5

    Default

    Quote Originally Posted by JosAH View Post
    There are some open source libraries available that can read .pdf files and extract the text portions from it; google is your friend here: "Java pdf read" shows the results.

    i couldn't find nothing helpful with google. what i'm looking for is an example written in java.

  4. #4
    Tolls is offline Moderator
    Join Date
    Apr 2009
    Posts
    11,817
    Rep Power
    19

    Default

    According to PDFBox documentation (well the first page) it says it can extract text, and has a command line for that.
    Since the source code is available, why not look at it for ExtractText?

  5. #5
    JosAH's Avatar
    JosAH is offline Moderator
    Join Date
    Sep 2008
    Location
    Voorschoten, the Netherlands
    Posts
    13,352
    Blog Entries
    7
    Rep Power
    20

    Default

    Quote Originally Posted by j2me64 View Post
    i couldn't find nothing helpful with google. what i'm looking for is an example written in java.
    How strange because when I google for "Java pdf read" and follow the second link I get this page. It has a fine manual stuffed with easy to read Java examples ...

    kind regards,

    Jos

  6. #6
    Tolls is offline Moderator
    Join Date
    Apr 2009
    Posts
    11,817
    Rep Power
    19

    Default

    To be fair, that's not exactly a free thing (outside of the trial).

  7. #7
    JosAH's Avatar
    JosAH is offline Moderator
    Join Date
    Sep 2008
    Location
    Voorschoten, the Netherlands
    Posts
    13,352
    Blog Entries
    7
    Rep Power
    20

    Default

    Quote Originally Posted by Tolls View Post
    To be fair, that's not exactly a free thing (outside of the trial).
    Duh, I didn't even notice that but I'm sure there are a lot more resources available (for free) because .pdf file format is soooo common.

    kind regards,

    Jos

  8. #8
    Tolls is offline Moderator
    Join Date
    Apr 2009
    Posts
    11,817
    Rep Power
    19

    Default

    As I say, the Apache thing has a tool for extracting text, so I would argue it's just a case of opening up that source code and seeing what they do.

  9. #9
    j2me64's Avatar
    j2me64 is offline Senior Member
    Join Date
    Sep 2009
    Location
    Zurich, Switzerland
    Posts
    962
    Rep Power
    5

    Default

    Quote Originally Posted by JosAH View Post
    How strange because when I google for "Java pdf read" and follow the second link I get this page. It has a fine manual stuffed with easy to read Java examples ...

    i've tried out some jars goole suggested and i could extract the text from a pdf but it looked like this

    I1enti1iers 1nd Ke1word1
    Al1 the J1va co1pone1ts we 1ust t1lked 1bout1clas1es, v1riab1es, a1d met1odsó1
    nee1 name1. In J1va th1se na1es ar1 call1d ide1tifi1rs, a1d, as 1ou mi1ht ex1ect,1
    the1e are 1ules 1or wh1t con1titu1es a l1gal J1va id1ntif1er. B1yond 1hat'1 lega1,

    unusable. i was not in the mood for trying all jars google suggested. but now the AspriseJavaPDF.jar works great and the output is very proper! thank you again.


    the only drawback is this:

    ****** There are more text found, however this evaluation version only allows you to extract max. 1000 chars per page. Simply purchase a license to remove this restriction. Asprise Java PDF Library ******

    and the price of the Java PDF Reader for a single developer license is USD 998.00
    Last edited by j2me64; 06-16-2010 at 10:15 PM.

  10. #10
    Webuser is offline Senior Member
    Join Date
    Dec 2008
    Posts
    526
    Rep Power
    0

    Lightbulb

    Quote Originally Posted by j2me64 View Post
    i want to read a pdf file and output only the text in a console. how to accomplish this? perhaps somebody can give a hint, even with pdfbox-1.1.0.jar?
    hello
    I recommend this lib Apache PDFBox - Apache PDFBox - Java PDF Library
    If my answer helped you. Please click my "REP" button and add a comment
    Have a Good Java Coding :)

  11. #11
    JosAH's Avatar
    JosAH is offline Moderator
    Join Date
    Sep 2008
    Location
    Voorschoten, the Netherlands
    Posts
    13,352
    Blog Entries
    7
    Rep Power
    20

    Default

    Quote Originally Posted by Webuser View Post
    PDFBox was already recommended in this thread (reply #4) so your reply is of no use and redundant; no rep points for you no matter how much you beg for it.

    Jos
    Last edited by JosAH; 06-17-2010 at 07:40 AM.

  12. #12
    JosAH's Avatar
    JosAH is offline Moderator
    Join Date
    Sep 2008
    Location
    Voorschoten, the Netherlands
    Posts
    13,352
    Blog Entries
    7
    Rep Power
    20

    Default

    Quote Originally Posted by j2me64 View Post
    i've tried out some jars goole suggested and i could extract the text from a pdf but it looked like this
    I1enti1iers 1nd Ke1word1
    Al1 the J1va co1pone1ts we 1ust t1lked 1bout1clas1es, v1riab1es, a1d met1odsó1
    nee1 name1. In J1va th1se na1es ar1 call1d ide1tifi1rs, a1d, as 1ou mi1ht ex1ect,1
    the1e are 1ules 1or wh1t con1titu1es a l1gal J1va id1ntif1er. B1yond 1hat'1 lega1,
    unusable. i was not in the mood for trying all jars google suggested. but now the AspriseJavaPDF.jar works great and the output is very proper! thank you again.
    How strange; I don't know anything about pdf format but I thought that text was simply represented by text (sprinkled with format information). It seems that if a character isn't recognized a '1' is printed ...

    kind regards,

    Jos

  13. #13
    j2me64's Avatar
    j2me64 is offline Senior Member
    Join Date
    Sep 2009
    Location
    Zurich, Switzerland
    Posts
    962
    Rep Power
    5

    Default

    Quote Originally Posted by Webuser View Post

    when i execute pdfbox with java org.apache.pdfbox.ExtractText everything went fine and i got the usage text on my console. as soon i give the pdf document to use i got this

    Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/lo
    gging/LogFactory
    at org.apache.pdfbox.pdfparser.BaseParser.<clinit>(Ba seParser.java:58)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocume nt.java:865)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocume nt.java:831)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocume nt.java:756)
    at org.apache.pdfbox.ExtractText.main(ExtractText.jav a:179)
    Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFacto
    ry
    at java.net.URLClassLoader$1.run(Unknown Source)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(Unknown Source)
    at sun.misc.Launcher$ExtClassLoader.findClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    ... 5 more

    1) the commons-logging-api-1.1.1.jar is in the classpath and there is a org.apache.commons.logging.LogFactory in it.
    2) i don't get what the "AccessController.doPrivileged" means.

    somebody can help?

  14. #14
    r035198x is offline Senior Member
    Join Date
    Aug 2009
    Posts
    2,388
    Rep Power
    7

    Default

    The logging jar with the LogFactory class is not on the classpath.
    What makes you sure that it is there?

Similar Threads

  1. Must Read.....
    By sanjeevtarar in forum Forum Lobby
    Replies: 10
    Last Post: 03-03-2010, 07:16 PM
  2. Read Xls
    By Deepa in forum New To Java
    Replies: 2
    Last Post: 01-16-2009, 12:46 PM
  3. Replies: 5
    Last Post: 10-17-2008, 02:13 PM
  4. Please Read!!!
    By jeffranc in forum New To Java
    Replies: 0
    Last Post: 08-21-2008, 08:47 PM
  5. How to read the following
    By rrp in forum New To Java
    Replies: 0
    Last Post: 12-03-2007, 06:16 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •