Parsing Real World HTML with XPath support
I am using TagSoup+XOM per:
BadMagicNumber » Using XPath on real-world HTML documents
seems to work well except the following namespace problem:
Dom4j + XPath + TagSoup – Namespaces = sweet! :: Kelvin Tan - Lucene Solr Nutch Consultant
It seems other parsers are available:
Open Source HTML Parsers in Java
some of which support XPath.
Any ideas which is fastest for real-world HTML?
Any ideas if XOM is best way to go, or Dom4j, etc.?