Results 1 to 16 of 16
  1. #1
    masterrs.mind is offline Member
    Join Date
    Feb 2010
    Posts
    20
    Rep Power
    0

    Default Help with Regex to get only <td></td> and the text within it in a <table> tag

    Hi All,

    I am trying to make some changes in the <html> content as a part of migration.
    I was told to use Regular expressions to achieve it.I have a requirement where I need to get the <td></td> within the <table>
    Here is the sample text

    Java Code:
    <table summary="This table defines a format the inner content section of page" cellpadding="0" cellspacing="0" border="0" width="602">
    <tbody>
    <tr>
    <td width="534">
    <table summary="This table defines a format of page header" cellpadding="0" cellspacing="0" border="0" width="602">
    <tbody>
    <tr>
    [B]<td>WELCOME TO MAIN PAGE</td>[/B]
    </tr>
    <tr>
    <td background><img border="0" height="1" width="200"
    src=""/></td>
    </tr>
    </tbody>
    </table>
    I am looking for a regex which gets the <td></td> which only contain the text within it.
    ie.,<td>WELCOME TO MAIN PAGE</td>
    and need to replace <td></td> with <h1>WELCOME TO MAIN PAGE</h1>
    I have come up with regex like this
    and my output must look like:
    <h1>WELCOME TO MAIN PAGE</h1>


    String contents = stringBuffer.toString();
    java.util.regex.Pattern pattern = Pattern.compile("<table(.|\n)*?/table>");
    Matcher m = pattern.matcher(contents);
    StringBuffer sb = new StringBuffer();
    boolean result = m.find();
    while(result) {
    m.appendReplacement(sb, ip);
    result = m.find();
    }
    m.appendTail(sb);
    String requiredOutput = sb.toString();
    but this is strippring evrything within the <table> tag.
    Can some one give me the regex?

    Thanks,
    Ramya.
    Last edited by Fubarable; 02-27-2010 at 07:39 PM. Reason: code tags added

  2. #2
    travishein's Avatar
    travishein is offline Senior Member
    Join Date
    Sep 2009
    Location
    Canada
    Posts
    684
    Rep Power
    6

    Default

    I found this regular expression seems to work.

    ".*<td>([^<]*)</td>.*"

    where it matches the td tag and content up to the next closing td tag.

    For example,
    Java Code:
      public  String replaceTd(String content) {
        String result = null;
        String regex = ".*<td>([^<]*)</td>.*";
        Pattern p = Pattern.compile(regex);
        Matcher m = p.matcher(content);
        if (m.matches()) {
          String tdValue = m.group(1);
          System.out.println("content:" + tdValue);
          String searchString = "<td>" + tdValue + "</td>";
          String replaceString = "<h1>" + tdValue + "</h1>";
          result = content.replaceAll(searchString, replaceString);
        }
        return result;
      }

  3. #3
    CodesAway's Avatar
    CodesAway is offline Senior Member
    Join Date
    Sep 2009
    Location
    Texas
    Posts
    238
    Rep Power
    6

    Default

    Quote Originally Posted by travishein View Post
    I found this regular expression seems to work.

    ".*<td>([^<]*)</td>.*"

    where it matches the td tag and content up to the next closing td tag.

    For example,
    Java Code:
      public  String replaceTd(String content) {
        String result = null;
        String regex = ".*<td>([^<]*)</td>.*";
        Pattern p = Pattern.compile(regex);
        Matcher m = p.matcher(content);
        if (m.matches()) {
          String tdValue = m.group(1);
          System.out.println("content:" + tdValue);
          String searchString = "<td>" + tdValue + "</td>";
          String replaceString = "<h1>" + tdValue + "</h1>";
          result = content.replaceAll(searchString, replaceString);
        }
        return result;
      }
    That won't work if you have, for example "<i>Italic text</i>, <b>Bold text</b>, and normal text" as your content, since your regex will only capture characters other than "<".

    Also, you regex will retrieve EVERYTHING and then backtrace - not good at all!

    A better regex would be <td>(.*?)</td>. Simple, yet effective. This works unless you have a table within a table. If you do have nested tables, you would need a sligtly more complex regex.

    This regex. will work as expected, capturing the <td> and </td> parts, and storing the text in the middle into group 1.
    Last edited by CodesAway; 02-28-2010 at 09:42 PM.
    CodesAway - codesaway.info
    writing tools that make writing code a little easier

  4. #4
    travishein's Avatar
    travishein is offline Senior Member
    Join Date
    Sep 2009
    Location
    Canada
    Posts
    684
    Rep Power
    6

    Default

    that IS much better. *likes*.

  5. #5
    masterrs.mind is offline Member
    Join Date
    Feb 2010
    Posts
    20
    Rep Power
    0

    Default

    Thanks travishein and CodesAway,

    Actually the content is more like a nested table,I need a Regx which must also look for the child table within the parent.
    Also,
    The Regex you have provided ".*<td>([^<]*)</td>.*" looks for only <td></td> tag,but not within <table> </table>tag(I think the Regex must look like something like
    "*<table>...<td>([^<]*)</td>.*",like a <td> within table,please correct me if iam wrong).

    So I am looking for a Regex more complex,must work for nested tables also

    Thanks,
    Ramya.

  6. #6
    Fubarable's Avatar
    Fubarable is offline Moderator
    Join Date
    Jun 2008
    Posts
    19,316
    Blog Entries
    1
    Rep Power
    26

    Default

    Are you absolutely sure that you don't want to use a dedicated HTML or XHTML parser for this?

  7. #7
    masterrs.mind is offline Member
    Join Date
    Feb 2010
    Posts
    20
    Rep Power
    0

    Default

    Fubarable,

    I am not supposted to use any HTML parsers as such.:(
    So Iam looking for a Regeular expression instead

    Thanks,
    Ramya.

  8. #8
    masterrs.mind is offline Member
    Join Date
    Feb 2010
    Posts
    20
    Rep Power
    0

    Default

    Thanks,

    Assuming the Table is a nested table and I need to get the
    <td>WELCOME TO MAIN PAGE</td>>
    within the child table and replace it with <h1>WELCOME TO MAIN PAGE</h1>
    So if I give my input something like this:

    <p><br />
    <table summary="This table defines/formats the inner content section of the page." cellpadding="0" cellspacing="0" border="0" width="602">
    <tbody>
    <tr>
    <td width="534">
    <table summary="This table defines/formats the page header." border="0" cellspacing="0" cellpadding="0" width="100%">
    <tbody>
    <tr>
    <td>WELCOME TO MAIN PAGE</td>>
    </tr>
    <tr>
    <td src="" /></td>
    </tr>
    </tbody>
    </table>

    I need to get <h1>WELCOME TO MAIN PAGE</h1>
    Can you give me a solution?
    Can I achieve it using a html parser,if yes,how can I?
    Please let me know.

    Thanks,
    Ramya.

  9. #9
    travishein's Avatar
    travishein is offline Senior Member
    Join Date
    Sep 2009
    Location
    Canada
    Posts
    684
    Rep Power
    6

    Default

    I had another thread i replied to recently, where I recommended to have a look at the HTML Parser - HTML Parser
    which is a kind of DOM -like java interface to parse a HTML document.

    where they have a node and NodeList type of structure, and you can iterate for each node,

    so its likely possible here to create a method that looks for the first table, then get the table inside this table, then the first td inside the nested table.

    the trick though is to have a kind of stream based parse mode i would suspect, where we go along and write out the same content that was read in for all nodes of the parser, except the node we are interested in rewriting the <td> to a <H>

    But this likely would be the best general purpose way to do it using a parser.


    lol, is it 'cross posting' if i use the same reply for different threads

  10. #10
    masterrs.mind is offline Member
    Join Date
    Feb 2010
    Posts
    20
    Rep Power
    0

    Default

    Hi,

    Thanks travishein.
    Iam new to HTML parsing,is there any sample/rough draft of code,which I can work for my case.
    Also I found "jericho" somewhere on the web,not sure whether it will be useful in this case?Do you have any idea about it?

    Thanks,
    Ramya.

  11. #11
    travishein's Avatar
    travishein is offline Senior Member
    Join Date
    Sep 2009
    Location
    Canada
    Posts
    684
    Rep Power
    6

    Default

    Ah neat. I didn't know about the jericho parser before. Ill have to take a look at it sometime. But quickly reading over their javadocs, it appears their API might be more useful for trying to read and transform HTML, the just the right <TD> tag into a H1 tag . I remember that htmlparser being very verbose, like working with the DOM api, but it kind of feels like it was meant for reading only specific parts given HTML document and the stream like transform you would want here might be more work to do.. or not.. I really don't know. This is more than I have gotten use out of it.

  12. #12
    masterrs.mind is offline Member
    Join Date
    Feb 2010
    Posts
    20
    Rep Power
    0

    Default

    Hi travishein,

    I used jericho parser to replace <table> with <h1></h1>,iam using following Regex for it
    out = (out.replaceFirst ("<table(?://s[^>]*)?>(?:(?>[^<]+)|<(?!table(?://s[^>]*)?>))*?</table>","<h1>" + tdContent + "</h1>")).trim();

    It works fine and was able to replace with <h1>td content</h1>
    if I give input as

    <table>
    <tbody>
    <tr>
    <td>COLLEGE ZONE OUTREACH CENTERS</td>
    </tr>
    <tr>
    <td></td>
    </tr>
    </tbody>
    </table>

    But if the <table> tag contains any properties within it,the Regex is not working,means it is not replacing the table.

    Something like this
    <table summary="This table defines/formats the page header." border="0" cellspacing="0" cellpadding="0" width="100%">
    <tbody>
    <tr>
    <td>Welcome to main page</td>
    </tr>
    <tr>
    <td background="http:///images/title_long_dots.gif"><img border="0" height="1" width="200" src="http:///images/spacer.gif" /></td>
    </tr>
    </tbody>
    </table>

    Can you please look into my Regex and let me know what is wrong with it?

    Thanks,
    Ramya.

    P.S:Please kindly bear with me,if iam asking you too many questions.Being new to Regex is taking me too long to figure it out.

  13. #13
    travishein's Avatar
    travishein is offline Senior Member
    Join Date
    Sep 2009
    Location
    Canada
    Posts
    684
    Rep Power
    6

    Default

    Not sure this is kind of funky regex from what I;m used to.

    how about if you specify the //s to be optional

    as instead of
    Java Code:
    <table(?://s[^>]*)?>
    you do

    Java Code:
    <table(?://s*[^>]*)?>
    so,

    Java Code:
    "<table(?://s*[^>]*)?>(??>[^<]+)|<(?!table(?://s*[^>]*)?>))*?</table>","<h1>" + tdContent + "</h1>"
    Edit: doh, no //s is whitespace, i thought that was string at first.

    ok,

    well, wouldn't just to say anything except ">" after table work?
    Java Code:
    "<table(?:[^>]*)?>(??>[^<]+)|<(?!table(?:[^>]*)?>))*?</table>","<h1>" + tdContent + "</h1>"
    Last edited by travishein; 03-02-2010 at 05:57 PM. Reason: doh

  14. #14
    masterrs.mind is offline Member
    Join Date
    Feb 2010
    Posts
    20
    Rep Power
    0

    Default

    Thanks a lot travishein!!

    It is working to some extent,might be there is some problem while traversing the properties within <table>,if i remove the cotes" " before and after the property(summary,cellspacing,cellpadding,width within <table> and background,image border,height,width within <td>) the Regex is working.Working for something like this.

    <table summary=This table defines/formats the page header. border=0 cellspacing=0 cellpadding=0 width=100%>
    <tbody>
    <tr>
    <td>COLLEGE ZONE OUTREACH CENTERS</td>
    </tr>
    <tr>
    <td background=http://images/title_long_dots.gif><img border=0 height=1 width=200 src=http://images/spacer.gif/></td>
    </tr>
    </tbody>
    </table>

    I appreciate your help,you are really so helpful..:)

    Regards,
    Ramya.

  15. #15
    masterrs.mind is offline Member
    Join Date
    Feb 2010
    Posts
    20
    Rep Power
    0

    Default

    One more thing,I have multiple Regexs in my code to deal with html tags replacement and html tag stripping(like removing images,removing tables,stripping non breakable spaces etc..).I am looking for a sample configuration file to use instead of hard coding everything.Can you provide me any example config file for my program.

  16. #16
    travishein's Avatar
    travishein is offline Senior Member
    Join Date
    Sep 2009
    Location
    Canada
    Posts
    684
    Rep Power
    6

    Default

    I would try to create a kind of Transform interface, where a method signature that makes sense maybe take in the string manipulate the content, and return the modified string.

    And then create an implementation class of this Transform interface foe each operation, such as RemoveImageTransform, RemoveTablesTransform, etc)

    and then when these are individual java files, have a top level TransforRunner class that would read its configuration from something like an XML file which would contain the sequence of the transform classes to be loaded, and expecting to be implementing the Transform interface,.

    If you use the springframework, this is natural to do in their applicationContext.xml with something like
    Java Code:
    <bean id="transformRunner"  class="mypackage.TransFormRunner">
      <property name="transform">
      <list>
        <bean class="mypackage.RemoveImageTransform"/>
        <bean calss="mypackage.RemoveTablesTransform"/>
        <!-- .. and so on -->
      </list>
      </property>
    </bean>
    That is not to say you have to use spring framework, if you are not using it, it should likely be possible for you to create a stand-alone transform runner class that knows to read its own XML or .properties file that invokes as many specific transform implementation classes as needed.

    But the idea is I think that with the individual transforms in its own implementation class, it is possibly not as efficient as a monolithic class to do it all in one pass, but it is more modular, in a kind of a plug in kind of extend or change support by changing the configuration file.

Similar Threads

  1. Replies: 4
    Last Post: 02-05-2011, 07:47 PM
  2. Conveting a Text File to a Table
    By tudorH in forum JDBC
    Replies: 8
    Last Post: 02-26-2010, 07:41 AM
  3. Replies: 1
    Last Post: 01-08-2010, 07:19 AM
  4. Replies: 2
    Last Post: 02-28-2009, 08:30 AM
  5. [SOLVED] More RegEx help
    By JT4NK3D in forum New To Java
    Replies: 2
    Last Post: 05-23-2008, 05:07 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •