Results 1 to 8 of 8
- 03-31-2011, 05:48 PM #1
Senior Member
- Join Date
- Feb 2011
- Posts
- 107
- Rep Power
- 0
String operations or regex please
I am trying to parse a page which has headlines of this sort on it. Now my objective is to isolate all instance of links that are found in the h1 tag of class headlineJava Code:<h1 class="headline1"><a href="article/2011-03-31-obama-saga-sucks-in-senior-democrats"> Obama in serious trouble </a></h1>
and the concatenate it with http://www.website.com/ so I end up with:Java Code:article/2011-03-31-obama-saga-sucks-in-senior-democrats
I will then use this url to get the content of article in question.Java Code:http://www.website.com/article/2011-03-31-obama-saga-sucks-in-senior-democrats
There are perhaps 15 links with the structure described above. How get these links an ignore all the other HTML on the page. Help please :(
- 03-31-2011, 06:02 PM #2
Senior Member
- Join Date
- Oct 2010
- Location
- Germany
- Posts
- 780
- Rep Power
- 4
You could use the Scanner class.
The following is not tested (and works only if the links are exactly like you have said)
Try something like
Java Code:Scanner sc = new Scanner(new URL("HERE YOUR URL").openStream()); while (sc.findWithinHorizon("<h1 class=\"headline1\"><a href=\"(.+?)\">", 0) != null) { System.out.println("http://www.website.com/" + sc.match().group(1)); //do anything with the string :) }
- 03-31-2011, 06:12 PM #3
Senior Member
- Join Date
- Feb 2011
- Posts
- 107
- Rep Power
- 0
Let me test this, thanks for spending your time on this
-
eRaaaa, i'm not so good with regex, but is it possible to replace this part:
<h1 class=\"headline1\">
with this:
<h1*>
to eliminate the need for a particular CSS class?
- 03-31-2011, 06:23 PM #5
Senior Member
- Join Date
- Oct 2010
- Location
- Germany
- Posts
- 780
- Rep Power
- 4
You mean <h1 .*?> or?
Yes its possible, but I thought he wants only extract the links of class headline1
Maybe he means <h1 class=\"headline.*?\"> too ?! :)"Now my objective is to isolate all instance of links that are found in the h1 tag of class headline "
-
no i think he only wanted the class headline like you originally posted, but i was just curious myself, thanks.
what i meant was, so that the program would recognize all forms of H1
<h1>
<h1 class="head1">
<h1 class="head2">
<h1 color=red class="otherHead">
i thought that * indicates any character or no character,
but i guess thats not correct
- 03-31-2011, 06:31 PM #7
Senior Member
- Join Date
- Feb 2011
- Posts
- 107
- Rep Power
- 0
I want to do this operation on links within all such tags.Java Code:<h1 class="headline1">
- 03-31-2011, 07:14 PM #8
Senior Member
- Join Date
- Oct 2010
- Location
- Germany
- Posts
- 780
- Rep Power
- 4
Similar Threads
-
parse simple string with regex?
By zardos in forum New To JavaReplies: 1Last Post: 03-01-2011, 12:14 PM -
Using regex to replace characters in a string
By DC200 in forum New To JavaReplies: 7Last Post: 10-13-2010, 02:35 PM -
breaking up a string, a regex problem!!
By A.n.H in forum Advanced JavaReplies: 7Last Post: 05-18-2010, 02:39 AM -
breaking up a string, a regex problem!!
By A.n.H in forum Advanced JavaReplies: 0Last Post: 05-17-2010, 03:03 PM -
String operations..
By sireesha in forum New To JavaReplies: 4Last Post: 12-14-2007, 02:04 AM


LinkBack URL
About LinkBacks
Reply With Quote
Bookmarks