HTML table to two-dimensional array
Hello everyone!
The problem is to parse the HTML table into the 2D array.
The requirements are:
1. The parser should handle rowspans, collspans, nested tables and so on. That means, everything that suites HTML standard. The thing is that the tables I need to parse are automatically generated by another program, so they are rather complex and excessive.
2. If the table contains spans, the value of the spanned cell should be put only ones in array in the left up corner of cells of this array, the other cells, correspondent to the spanned cell in HTML table should be nulled (maybe it is hard to understand, see example below).
To date I found a solutions on php (JS_Extractor JS_Extractor! And the death of Table Extractor - Jack Sleight), and Java (Java HTML Table parser Simbiosis), but is seems like they don't suite my requirements, as JS_Extractor is written on PHP and doesn't handle inherited tables, "Java HTML Table parser Simbiosis" doesn't handle even spans.
Today I tried to use HTTPUnit, but the results are disappointing too. The simple tables are parsed correctly, but the complex ones are not.
E.g.
Table code:
Code:
<html>
<body>
<table border="2" width="20%" height="20%">
<tr bgcolor="red">
<td colspan="2" rowspan="2">
<span>1</span>
</td>
<td>
<span>2</span>
</td>
<td>
<span>3</span></td>
<td>
<span>4.1</span>
</td>
<td>
<span>5.1</span>
</td>
<td>
<span>6 last</span>
</td>
</tr>
<tr bgcolor="green">
<td rowspan="2">
<span>1</span>
</td>
<td>
<span>2.4x</span>
</td>
<td>
<span>3.3x</span>
</td>
<td>
<span>4</span>
</td>
<td>
<span>5 last</span>
</td>
</tr>
<tr bgcolor="ffcc00">
<td>
<span>1x</span>
</td>
<td>
<span>2</span>
</td>
<td>
<span>3</span>
</td>
<td>
<span>4</span>
</td>
<td>
<span>5.8</span>
</td>
<td>
<span>6 last</span>
</td>
</tr>
<tr bgcolor="yellow">
<td><span>1</span></td>
<td><span>2</span></td>
<td><span>3</span></td>
<td><span>4</span></td>
<td><span>5</span></td>
<td><span>6</span></td>
<td><span>7 last</span></td>
</tr>
</table>
</body>
</html>
This is a table, shown in Chrome (various colors represent correspondent rows):
http://xmages.net/storage/10/1/0/1/a...d/ba6aad88.png
An array I want to see as a result:
http://xmages.net/storage/10/1/0/a/c...d/01419959.png
Here you can see what I meant in the requirement number 2. "1" from the fist span and "1" from the second span are put in the left upper corner of the spanned area, while the rest cells in this area are null.
The result, given by HTTPUnit:
http://xmages.net/storage/10/1/0/0/9...d/999751f7.png
As you can see, even if we throw the requirement 2 away, we have an error in the third row here.
And this is a rather simple example, without inherited tables, with them it is terribly wrong.
What can you recomend me in that case?
I'll be happy for any help, as I can't believe this problem was not solved yet!