Im trying to make a xhtml file parser with regular expressions. I want it to find tags and store them. for example :
<html>
<head>
<title>example</title>
</head>
<body>
<p>para</p>
<hr />
<p>para2</p>
</body>
</html>
The application should return:
<html>:1
<head>:1
<title>:1
</title>:1
</head>:1
<body>:1
<p>:2
</p>:2
<hr />:1
</body>:1
</html>:1
After I'll make it group together starting and ending tags (<head> ,</head>)
and make it recognize that something isn't a tag if a ! follows the <
(<!-- --> and <!DOCTYPE html...) and to skip between <script> and </script>
since there would be no tags, and there might be confusion with java/vb script > sign and < sign etc. I'll also change it so that it shows only < tagname > and no attributes.
But for now, I just want to get it started. Here is my code so far. I need help where it says so in the comments:
/**
* This class parses xhtml files - 05/22/08
* and returns each unique tag
* type and the quantity of it
* using java.util.regex package.
*/
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class XCheck1 { // simple driver class to be improved later.
public static void main() {
XC_Model model = new XC_Model();
model.run();
}
}
class XC_Model {
private String[] tags; // array to store the tags
private int[] tagCounts; // each item in tags has the same index item in counts that stores
private String cmatch; // cmatch = current match // ^ how many of that tag there are.
private boolean found; // if another of the same type of tag is found in the array
private int top;
public XC_Model() { // paramless constructor initializes fields
tags = new String[500]; // I'll deal with more then 500 tags later
tagCounts = new String[500]; // see above^
cmatch = "";
top = 0;
}
public String getData() { // I'll change this after, for now it can parse Strings of "XHTML"
String data = "<p>This is a <em><strong>XHTML</strong></em> paragraph</p><hr />"// tags and random text
return data;
}
public void doScan() { // the method that actually does the parsing
Pattern pattern = Pattern.compile("<.*>"); // This should mean "<".....">"
Matcher matcher = pattern.matcher(getData()); // getData later will get the file
while( /* theres more matches */ ) { // ************ this area is where i need help ********
cmatch = /* The current match */ // i'm not sure how to use regex to match 1 by 1 like this
found = false; // found starts off as false
for( int i = 0; i <= top; i++ ) {
if( cmatch.equals(tags[i]) ) {// if it finds another tag that is the same
tagCounts[i]++; // add 1 to the quantity of that tag
found = true; // found a match is true
i = top + 1; // break off any more looping
}
}
if( found == false ) { //if this is the first encounter with this tag
tags[top] = cmatch; // add it to the array
tagCounts[top] = 1; // 1 of this unique tag so far
top++; // shift up to the next item
}
}
}
public void run() { // later this will have a param for xhtml file
doScan();
for( int j = 0; j < tags.length; j++) {
System.out.println(tags[j] + ": " + tagCounts[j]); // print out the results
}
}
}
please reply i need help with this