
02-27-2010, 06:25 PM
|
|
Member
|
|
Join Date: Feb 2010
Posts: 20
Rep Power: 0
|
|
Help with Regex to get only <td></td> and the text within it in a <table> tag
Hi All,
I am trying to make some changes in the <html> content as a part of migration.
I was told to use Regular expressions to achieve it.I have a requirement where I need to get the <td></td> within the <table>
Here is the sample text
|
Code:
|
<table summary="This table defines a format the inner content section of page" cellpadding="0" cellspacing="0" border="0" width="602">
<tbody>
<tr>
<td width="534">
<table summary="This table defines a format of page header" cellpadding="0" cellspacing="0" border="0" width="602">
<tbody>
<tr>
<td>WELCOME TO MAIN PAGE</td>
</tr>
<tr>
<td background><img border="0" height="1" width="200"
src=""/></td>
</tr>
</tbody>
</table> |
I am looking for a regex which gets the <td></td> which only contain the text within it.
ie., <td>WELCOME TO MAIN PAGE</td>
and need to replace <td></td> with <h1>WELCOME TO MAIN PAGE</h1>
I have come up with regex like this
and my output must look like:
<h1>WELCOME TO MAIN PAGE</h1>
String contents = stringBuffer.toString();
java.util.regex.Pattern pattern = Pattern.compile("<table(.|\n)*?/table>");
Matcher m = pattern.matcher(contents);
StringBuffer sb = new StringBuffer();
boolean result = m.find();
while(result) {
m.appendReplacement(sb, ip);
result = m.find();
}
m.appendTail(sb);
String requiredOutput = sb.toString();
but this is strippring evrything within the <table> tag.
Can some one give me the regex?
Thanks,
Ramya.
Last edited by Fubarable; 02-27-2010 at 06:39 PM.
Reason: code tags added
|
|

02-27-2010, 07:28 PM
|
 |
Senior Member
|
|
Join Date: Sep 2009
Location: Canada
Posts: 456
Rep Power: 1
|
|
I found this regular expression seems to work.
".*<td>([^<]*)</td>.*"
where it matches the td tag and content up to the next closing td tag.
For example,
|
Code:
|
public String replaceTd(String content) {
String result = null;
String regex = ".*<td>([^<]*)</td>.*";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(content);
if (m.matches()) {
String tdValue = m.group(1);
System.out.println("content:" + tdValue);
String searchString = "<td>" + tdValue + "</td>";
String replaceString = "<h1>" + tdValue + "</h1>";
result = content.replaceAll(searchString, replaceString);
}
return result;
} |
|
|

02-28-2010, 08:38 PM
|
 |
Senior Member
|
|
Join Date: Sep 2009
Location: Texas
Posts: 238
Rep Power: 1
|
|
Originally Posted by travishein
|
I found this regular expression seems to work.
".*<td>([^<]*)</td>.*"
where it matches the td tag and content up to the next closing td tag.
For example,
|
Code:
|
public String replaceTd(String content) {
String result = null;
String regex = ".*<td>([^<]*)</td>.*";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(content);
if (m.matches()) {
String tdValue = m.group(1);
System.out.println("content:" + tdValue);
String searchString = "<td>" + tdValue + "</td>";
String replaceString = "<h1>" + tdValue + "</h1>";
result = content.replaceAll(searchString, replaceString);
}
return result;
} |
|
That won't work if you have, for example "<i>Italic text</i>, <b>Bold text</b>, and normal text" as your content, since your regex will only capture characters other than "<".
Also, you regex will retrieve EVERYTHING and then backtrace - not good at all!
A better regex would be <td>(.*?)</td>. Simple, yet effective. This works unless you have a table within a table. If you do have nested tables, you would need a sligtly more complex regex.
This regex. will work as expected, capturing the <td> and </td> parts, and storing the text in the middle into group 1.
__________________
CodesAway - codesaway.info
writing tools that make writing code a little easier
Last edited by CodesAway; 02-28-2010 at 08:42 PM.
|
|

02-28-2010, 09:26 PM
|
 |
Senior Member
|
|
Join Date: Sep 2009
Location: Canada
Posts: 456
Rep Power: 1
|
|
|
that IS much better. *likes*.
|
|

03-01-2010, 04:33 PM
|
|
Member
|
|
Join Date: Feb 2010
Posts: 20
Rep Power: 0
|
|
|
Thanks travishein and CodesAway,
Actually the content is more like a nested table,I need a Regx which must also look for the child table within the parent.
Also,
The Regex you have provided ".*<td>([^<]*)</td>.*" looks for only <td></td> tag,but not within <table> </table>tag(I think the Regex must look like something like
"*<table>...<td>([^<]*)</td>.*",like a <td> within table,please correct me if iam wrong).
So I am looking for a Regex more complex,must work for nested tables also
Thanks,
Ramya.
|
|

03-01-2010, 04:42 PM
|
 |
Moderator
|
|
Join Date: Jun 2008
Posts: 8,388
Rep Power: 11
|
|
|
Are you absolutely sure that you don't want to use a dedicated HTML or XHTML parser for this?
|
|

03-01-2010, 04:48 PM
|
|
Member
|
|
Join Date: Feb 2010
Posts: 20
Rep Power: 0
|
|
Fubarable,
I am not supposted to use any HTML parsers as such. 
So Iam looking for a Regeular expression instead
Thanks,
Ramya.
|
|

03-01-2010, 07:59 PM
|
|
Member
|
|
Join Date: Feb 2010
Posts: 20
Rep Power: 0
|
|
|
Thanks,
Assuming the Table is a nested table and I need to get the
<td>WELCOME TO MAIN PAGE</td>>
within the child table and replace it with <h1>WELCOME TO MAIN PAGE</h1>
So if I give my input something like this:
<p><br />
<table summary="This table defines/formats the inner content section of the page." cellpadding="0" cellspacing="0" border="0" width="602">
<tbody>
<tr>
<td width="534">
<table summary="This table defines/formats the page header." border="0" cellspacing="0" cellpadding="0" width="100%">
<tbody>
<tr>
<td>WELCOME TO MAIN PAGE</td>>
</tr>
<tr>
<td src="" /></td>
</tr>
</tbody>
</table>
I need to get <h1>WELCOME TO MAIN PAGE</h1>
Can you give me a solution?
Can I achieve it using a html parser,if yes,how can I?
Please let me know.
Thanks,
Ramya.
|
|

03-01-2010, 10:18 PM
|
 |
Senior Member
|
|
Join Date: Sep 2009
Location: Canada
Posts: 456
Rep Power: 1
|
|
I had another thread i replied to recently, where I recommended to have a look at the HTML Parser - HTML Parser
which is a kind of DOM -like java interface to parse a HTML document.
where they have a node and NodeList type of structure, and you can iterate for each node,
so its likely possible here to create a method that looks for the first table, then get the table inside this table, then the first td inside the nested table.
the trick though is to have a kind of stream based parse mode i would suspect, where we go along and write out the same content that was read in for all nodes of the parser, except the node we are interested in rewriting the <td> to a <H>
But this likely would be the best general purpose way to do it using a parser.
lol, is it 'cross posting' if i use the same reply for different threads
|
|

03-01-2010, 11:27 PM
|
|
Member
|
|
Join Date: Feb 2010
Posts: 20
Rep Power: 0
|
|
|
Hi,
Thanks travishein.
Iam new to HTML parsing,is there any sample/rough draft of code,which I can work for my case.
Also I found "jericho" somewhere on the web,not sure whether it will be useful in this case?Do you have any idea about it?
Thanks,
Ramya.
|
|

03-02-2010, 02:47 AM
|
 |
Senior Member
|
|
Join Date: Sep 2009
Location: Canada
Posts: 456
Rep Power: 1
|
|
|
Ah neat. I didn't know about the jericho parser before. Ill have to take a look at it sometime. But quickly reading over their javadocs, it appears their API might be more useful for trying to read and transform HTML, the just the right <TD> tag into a H1 tag . I remember that htmlparser being very verbose, like working with the DOM api, but it kind of feels like it was meant for reading only specific parts given HTML document and the stream like transform you would want here might be more work to do.. or not.. I really don't know. This is more than I have gotten use out of it.
|
|

03-02-2010, 03:36 PM
|
|
Member
|
|
Join Date: Feb 2010
Posts: 20
Rep Power: 0
|
|
Hi travishein,
I used jericho parser to replace <table> with <h1></h1>,iam using following Regex for it
out = (out.replaceFirst ("<table(?://s[^>]*)?>(?  ?>[^<]+)|<(?!table(?://s[^>]*)?>))*?</table>","<h1>" + tdContent + "</h1>")).trim();
It works fine and was able to replace with <h1>td content</h1>
if I give input as
<table>
<tbody>
<tr>
<td>COLLEGE ZONE OUTREACH CENTERS</td>
</tr>
<tr>
<td></td>
</tr>
</tbody>
</table>
But if the <table> tag contains any properties within it,the Regex is not working,means it is not replacing the table.
Something like this
<table summary="This table defines/formats the page header." border="0" cellspacing="0" cellpadding="0" width="100%">
<tbody>
<tr>
<td>Welcome to main page</td>
</tr>
<tr>
<td background="http:///images/title_long_dots.gif"><img border="0" height="1" width="200" src="http:///images/spacer.gif" /></td>
</tr>
</tbody>
</table>
Can you please look into my Regex and let me know what is wrong with it?
Thanks,
Ramya.
P.S:Please kindly bear with me,if iam asking you too many questions.Being new to Regex is taking me too long to figure it out.
|
|

03-02-2010, 04:54 PM
|
 |
Senior Member
|
|
Join Date: Sep 2009
Location: Canada
Posts: 456
Rep Power: 1
|
|
Not sure this is kind of funky regex from what I;m used to.
how about if you specify the //s to be optional
as instead of
|
Code:
|
<table(?://s[^>]*)?> |
you do
|
Code:
|
<table(?://s*[^>]*)?> |
so,
|
Code:
|
"<table(?://s*[^>]*)?>(??>[^<]+)|<(?!table(?://s*[^>]*)?>))*?</table>","<h1>" + tdContent + "</h1>" |
Edit: doh, no //s is whitespace, i thought that was string at first.
ok,
well, wouldn't just to say anything except ">" after table work?
|
Code:
|
"<table(?:[^>]*)?>(??>[^<]+)|<(?!table(?:[^>]*)?>))*?</table>","<h1>" + tdContent + "</h1>" |
Last edited by travishein; 03-02-2010 at 04:57 PM.
Reason: doh
|
|

03-02-2010, 05:36 PM
|
|
Member
|
|
Join Date: Feb 2010
Posts: 20
Rep Power: 0
|
|
Thanks a lot travishein!!
It is working to some extent,might be there is some problem while traversing the properties within <table> ,if i remove the cotes" " before and after the property(summary,cellspacing,cellpadding,width within <table> and background,image border,height,width within <td>) the Regex is working.Working for something like this.
<table summary=This table defines/formats the page header. border=0 cellspacing=0 cellpadding=0 width=100%>
<tbody>
<tr>
<td>COLLEGE ZONE OUTREACH CENTERS</td>
</tr>
<tr>
<td background=http://images/title_long_dots.gif><img border=0 height=1 width=200 src=http://images/spacer.gif/></td>
</tr>
</tbody>
</table>
I appreciate your help,you are really so helpful..
Regards,
Ramya.
|
|

03-02-2010, 05:40 PM
|
|
Member
|
|
Join Date: Feb 2010
Posts: 20
Rep Power: 0
|
|
|
One more thing,I have multiple Regexs in my code to deal with html tags replacement and html tag stripping(like removing images,removing tables,stripping non breakable spaces etc..).I am looking for a sample configuration file to use instead of hard coding everything.Can you provide me any example config file for my program.
|
|

03-02-2010, 07:09 PM
|
 |
Senior Member
|
|
Join Date: Sep 2009
Location: Canada
Posts: 456
Rep Power: 1
|
|
I would try to create a kind of Transform interface, where a method signature that makes sense maybe take in the string manipulate the content, and return the modified string.
And then create an implementation class of this Transform interface foe each operation, such as RemoveImageTransform, RemoveTablesTransform, etc)
and then when these are individual java files, have a top level TransforRunner class that would read its configuration from something like an XML file which would contain the sequence of the transform classes to be loaded, and expecting to be implementing the Transform interface,.
If you use the springframework, this is natural to do in their applicationContext.xml with something like
|
Code:
|
<bean id="transformRunner" class="mypackage.TransFormRunner">
<property name="transform">
<list>
<bean class="mypackage.RemoveImageTransform"/>
<bean calss="mypackage.RemoveTablesTransform"/>
<!-- .. and so on -->
</list>
</property>
</bean> |
That is not to say you have to use spring framework, if you are not using it, it should likely be possible for you to create a stand-alone transform runner class that knows to read its own XML or .properties file that invokes as many specific transform implementation classes as needed.
But the idea is I think that with the individual transforms in its own implementation class, it is possibly not as efficient as a monolithic class to do it all in one pass, but it is more modular, in a kind of a plug in kind of extend or change support by changing the configuration file.
|
|
| Thread Tools |
|
|
| Display Modes |
Linear Mode
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
All times are GMT +2. The time now is 09:11 PM.
|
|