Results 1 to 3 of 3
- 01-28-2011, 12:31 PM #1
Member
- Join Date
- Jan 2011
- Posts
- 2
- Rep Power
- 0
HTML table to two-dimensional array
Hello everyone!
The problem is to parse the HTML table into the 2D array.
The requirements are:
1. The parser should handle rowspans, collspans, nested tables and so on. That means, everything that suites HTML standard. The thing is that the tables I need to parse are automatically generated by another program, so they are rather complex and excessive.
2. If the table contains spans, the value of the spanned cell should be put only ones in array in the left up corner of cells of this array, the other cells, correspondent to the spanned cell in HTML table should be nulled (maybe it is hard to understand, see example below).
To date I found a solutions on php (JS_Extractor JS_Extractor! And the death of Table Extractor - Jack Sleight), and Java (Java HTML Table parser Simbiosis), but is seems like they don't suite my requirements, as JS_Extractor is written on PHP and doesn't handle inherited tables, "Java HTML Table parser Simbiosis" doesn't handle even spans.
Today I tried to use HTTPUnit, but the results are disappointing too. The simple tables are parsed correctly, but the complex ones are not.
E.g.
Table code:
This is a table, shown in Chrome (various colors represent correspondent rows):XML Code:<html> <body> <table border="2" width="20%" height="20%"> <tr bgcolor="red"> <td colspan="2" rowspan="2"> <span>1</span> </td> <td> <span>2</span> </td> <td> <span>3</span></td> <td> <span>4.1</span> </td> <td> <span>5.1</span> </td> <td> <span>6 last</span> </td> </tr> <tr bgcolor="green"> <td rowspan="2"> <span>1</span> </td> <td> <span>2.4x</span> </td> <td> <span>3.3x</span> </td> <td> <span>4</span> </td> <td> <span>5 last</span> </td> </tr> <tr bgcolor="ffcc00"> <td> <span>1x</span> </td> <td> <span>2</span> </td> <td> <span>3</span> </td> <td> <span>4</span> </td> <td> <span>5.8</span> </td> <td> <span>6 last</span> </td> </tr> <tr bgcolor="yellow"> <td><span>1</span></td> <td><span>2</span></td> <td><span>3</span></td> <td><span>4</span></td> <td><span>5</span></td> <td><span>6</span></td> <td><span>7 last</span></td> </tr> </table> </body> </html>
An array I want to see as a result:

Here you can see what I meant in the requirement number 2. "1" from the fist span and "1" from the second span are put in the left upper corner of the spanned area, while the rest cells in this area are null.
The result, given by HTTPUnit:

As you can see, even if we throw the requirement 2 away, we have an error in the third row here.
And this is a rather simple example, without inherited tables, with them it is terribly wrong.
What can you recomend me in that case?
I'll be happy for any help, as I can't believe this problem was not solved yet!
- 01-28-2011, 12:44 PM #2
Senior Member
- Join Date
- Jun 2008
- Posts
- 2,366
- Rep Power
- 8
Okay? So try parsing it yourself.
You can pull out the table and then use an xmlparser to navigate it if you want to avoid the large part of the regex (which I would suggest doing).
- 01-28-2011, 01:16 PM #3
Member
- Join Date
- Jan 2011
- Posts
- 2
- Rep Power
- 0
Similar Threads
-
2 dimensional array help!
By ber1023 in forum New To JavaReplies: 9Last Post: 01-02-2011, 12:29 AM -
Two dimensional array
By niu_niu in forum New To JavaReplies: 4Last Post: 06-13-2010, 12:34 AM -
about two dimensional array
By matin1234 in forum New To JavaReplies: 2Last Post: 06-01-2010, 11:09 AM -
two dimensional array print
By Myles in forum New To JavaReplies: 0Last Post: 04-26-2010, 11:10 AM -
two-dimensional array
By kHim in forum New To JavaReplies: 4Last Post: 11-16-2008, 07:21 PM


LinkBack URL
About LinkBacks
Reply With Quote

Bookmarks