Results 1 to 4 of 4
  1. #1
    Daedalus is offline Member
    Join Date
    Sep 2008
    Posts
    14
    Rep Power
    0

    Default RegExp to remove tag from html file with exceptions

    I have a html file and I need to remove almost all tags from it leaving only plain text. That means HTML markup,css, javascript etc.

    There are many easy ways to strip ALL the tags but I need to leave some behind.

    The ones I need to keep are <p> <th> <td> <tr> <li> <h2> <h3> <h4> and their end tags </p> , </h2> etc.

    I'm new to Regular Expressions and this is what I have to far but it is not working,

    text.replaceAll("\\<(?!p|th|tr|td|h2|h3|h4|li).*?\ \>","");

    Any help would be most welcome

    Thanks

  2. #2
    Norm's Avatar
    Norm is offline Moderator
    Join Date
    Jun 2008
    Location
    SW Missouri
    Posts
    17,458
    Rep Power
    25

    Default

    Write a simple test program that you can post here for us to work on. Have a string with one or two tags to be stripped.
    println the before and after

  3. #3
    Daedalus is offline Member
    Join Date
    Sep 2008
    Posts
    14
    Rep Power
    0

    Default

    Java Code:
    public class Boot {
    
    	public static void main(String args[]) {
                 String str = "<span>blah</span><br><h2>title</h2><p>sometext</p><a href='ww'></a><!--meta --><tr><th>title</th></tr><tr><td>acell</td><td>acell</td></tr><li>item1</li><li>item2></li><script type='dd'></script>";
              System.out.println(str.replaceAll("\\</*?(p|th|tr|td|h2|h3|h4|li).+?\\>",""));
         }
    }
    Prints> "<span>blah</span><br><a href='ww'></a><!--meta -->titleacellacell</script>"

    The RegExp above now does exactly the opposite of what I want. In other words it replaces the tags I want and leaves the rest.

    I've hit a roadblock as I need a NOT sign infront of the (p|th|tr|td|h2|h3|h4|li) but as far as I can tell there is no such thing.

    I cant not list every single possible tag I dont want insted because there are 100s and pages can contain custom tags anyway, with XML for example.

    Sigh*

  4. #4
    Norm's Avatar
    Norm is offline Moderator
    Join Date
    Jun 2008
    Location
    SW Missouri
    Posts
    17,458
    Rep Power
    25

    Default

    I'd suggest that you start with a small case, get it to work and expand it as you figure out out it works. You do NOT have to list all the possible tags you want to remove. If you can figure out how to preserve one tag and then two tags you'll be able to preserve as many as you want.
    The negation sign is ^

    Here's a start:
    Java Code:
          String data = "<P>This is<B>a</B> sample.<H1>Header</H1><P>new para<UL><LI>first item</LI></P>";
    
          System.out.println("after1=" + data.replaceAll("<.*?>", ""));  // removes ALL tags
          //after1=This isa sample.Headernew parafirst item
          System.out.println("after2=" + data.replaceAll("<[^P].*?>", "")); // leaves the <P>
          //after2=<P>This isa sample.Header<P>new parafirst item
    So now figure out how to leave more than the single P.
    Last edited by Norm; 09-27-2008 at 02:45 PM. Reason: Changed replace to leave

Similar Threads

  1. How can I include a html file in html textarea?
    By surya_dks in forum New To Java
    Replies: 2
    Last Post: 10-04-2008, 07:20 AM
  2. Remove control characters in txt file
    By trivektor in forum New To Java
    Replies: 7
    Last Post: 09-23-2008, 04:22 PM
  3. remove a portion from a file
    By alon2580 in forum New To Java
    Replies: 13
    Last Post: 08-25-2008, 01:45 PM
  4. [SOLVED] Remove All Line from File
    By Mir in forum New To Java
    Replies: 41
    Last Post: 07-17-2008, 09:44 AM
  5. Remove duplicate lines from a text file
    By Dirt.Diver in forum New To Java
    Replies: 15
    Last Post: 06-25-2008, 02:08 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •