Results 1 to 6 of 6
  1. #1
    chamnan is offline Member
    Join Date
    Sep 2010
    Posts
    3
    Rep Power
    0

    Question How to convert html file to text

    Hello everyone! I am just studying computer science in year 3 and I am a newer with java program. My teacher put me a exercise like this:

    write a java programm to convert an html file to a text file :
    - extract all the text and only the text from the html file.
    - provide at least 2 different methods to do the conversion and tell me which one you prefer and why.

    I try to find a solution of this problem but I found little bit. So, please help me :confused:!

    Thank you,
    Chamnan

  2. #2
    RamyaSivakanth's Avatar
    RamyaSivakanth is offline Senior Member
    Join Date
    Apr 2009
    Location
    Chennai
    Posts
    826
    Rep Power
    6

    Default

    for each tag section create a method and try to extract the content in between tag and place it in a stringbuffer .finally write it into a file.

    For any problem u should write a psudo code what u exactly going to do and then u should start writing a program by keeping explanatory comments so that you can easily write a code as well as you can debug also.
    Ramya:cool:

  3. #3
    chamnan is offline Member
    Join Date
    Sep 2010
    Posts
    3
    Rep Power
    0

    Default

    Thank you !
    And could you give me some code please ?

  4. #4
    RamyaSivakanth's Avatar
    RamyaSivakanth is offline Senior Member
    Join Date
    Apr 2009
    Location
    Chennai
    Posts
    826
    Rep Power
    6

    Default

    You start writing the code...we will help you if you get any error or logical problem.
    Ramya:cool:

  5. #5
    Tolls is online now Moderator
    Join Date
    Apr 2009
    Posts
    11,931
    Rep Power
    19

    Default

    You're 3rd year computer science...you should be able to write this yourself, new to Java or not.

  6. #6
    neilcoffey is offline Senior Member
    Join Date
    Nov 2008
    Posts
    286
    Rep Power
    6

    Default

    I don't know how much Java programming you've done so far, but here are some pointers to some possible ways to do it:

    (1) Use the DOM (Document Object Model) framework. Essentially, you have some code that starts like this:

    Java Code:
    import javax.xml.parsers.*;
    import org.w3c.dom.*;
    
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    DocumentBuilder db = dbf.newDocumentBuilder();
    Document doc = db.parse(htmlFile);
    This builds in memory a representation of the content of the HTML document as a number of hierarchically arranged objects, which you can then access via "doc" and its children. Do some web searches for "Java DocumentBuilder" or somesuch and you find some clues in.

    (2) Use regular expressions. If you've not come across regular expressions, they're a way to say "pick me out all the elements in this text (=your HTML) that matches this parrt (=a pattern that picks out some tags you're interested in)". For some purposes, e.g. picking out all the links in a document, this is a nice "quick and nasty" way of doing it for when you need something that's quick to code and you don't mind it not catering with every single possibility. Disadvantage: it's hard to make it foolproof (e.g. tell the difference between a real link and, say, a text in some embedded JavaScript that looks like a link).

    As part of a tutorial that I wrote a while ago, I actually included an example of scraping HTML with regular expressions. However, notice that the idea here is to pick out certain items from the document only: that may or may not be what you had in mind.

    (3) You could "hand crank" some code that goes through the document character by character and doing whatever. This is the most flexible approach, and probably the one that will take longest to code and debug.

    (4) Within the bowels of Swing, there is actually an HTML parser. Do some web searching...
    Last edited by neilcoffey; 09-08-2010 at 01:06 PM. Reason: Changed "xml" to "html"

Similar Threads

  1. Replies: 1
    Last Post: 02-18-2010, 05:43 PM
  2. Convert xml to html using ant build
    By ketvaid1 in forum XML
    Replies: 1
    Last Post: 01-19-2010, 03:25 AM
  3. Replies: 0
    Last Post: 05-26-2008, 04:26 PM
  4. convert html to text using java
    By praveen@asia-mail.com in forum New To Java
    Replies: 1
    Last Post: 11-14-2007, 02:08 PM
  5. convert html to plain text
    By vissu007 in forum New To Java
    Replies: 3
    Last Post: 07-07-2007, 02:39 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •