Parse HTML, regex help
I am working on a project where I need to get specific content from an HTML page template on a newspaper site.
I need to get the heading and the body of the article
Ignore everything until you meet this
[SIZE="5"]<h1 id="article_headline">Spoilt ballot papers spark controversy</h1>[/SIZE]
Operate on that h1 tag and remain with: Spoilt ballot papers spark controversy
That is then passed to a String variable.
Then move on ignoring everything until we meet this:
[SIZE="5"]<span class="article_body">Whole article is here, but there is a catch here. Wait for it</span>[/SIZE]
We remove the tags and tag everything in the middle and pass it to a variable
The catch is that within the main article there is some text that appears with an advert that I want to take out
Can someone please help, I know this is a lot to ask but I really need the help.
<div class='articlecontinues'><img src='/images/icon_downarrow.gif' /> CONTINUES BELOW
<img src='/images/icon_downarrow.gif' /></div><center><div style='width:300px; height:250px;'>
<iframe marginwidth='0' marginheight='0' scrolling='no' frameborder='0' width='300' height='250'