Results 1 to 8 of 8
- 02-21-2010, 05:42 PM #1
Member
- Join Date
- Feb 2010
- Posts
- 14
- Rep Power
- 0
- 02-22-2010, 03:26 AM #2
Senior Member
- Join Date
- Dec 2009
- Location
- Belgrade, Serbia
- Posts
- 364
- Rep Power
- 4
If you only need "TEXT" maybe you could try with ContentFilter class.
Examples are here (from 15.4...):
The Element Class
"...suppose your application only needs to concern itself with elements and text, but can completely skip all comments and processing instructions..."
First just parse all to see it original document is valid,
then use TreePrinter class from 'Example 15.2. Inspecting elements'
so see whole tree and after that use filters to extract wheat you need
is this useful?
- 02-23-2010, 03:54 PM #3
Member
- Join Date
- Feb 2010
- Posts
- 14
- Rep Power
- 0
no....i cant parse xhtml files with this....it gives error
for example:15.2....i gess its entity resolver problem.....i dont know to solve....pls help
i also tried this sample prog:
import org.jdom.Document;
import org.jdom.input.*;
import org.jdom.Element;
import org.jdom.output.XMLOutputter;
public class testerr {
public static void main(String[] args) throws Exception
{
SAXBuilder parser = new SAXBuilder();
parser.setValidation(true);
parser.setEntityResolver(new XhtmlEntityResolver());
Document doc= parser.build("http://www.imsc.res.in/local/Html/example_homepage.html");
XMLOutputter outputter = new XMLOutputter();
outputter.output(doc, System.out);
}
}
error for 15.2:
Caused by: org.xml.sax.SAXParseException: Document is invalid: no grammar found.
at com.sun.org.apache.xerces.internal.util.ErrorHandl erWrapper.createSAXParseException(ErrorHandlerWrap per.java:195)
at com.sun.org.apache.xerces.internal.util.ErrorHandl erWrapper.error(ErrorHandlerWrapper.java:131)
at com.sun.org.apache.xerces.internal.impl.XMLErrorRe porter.reportError(XMLErrorReporter.java:384)
at com.sun.org.apache.xerces.internal.impl.XMLErrorRe porter.reportError(XMLErrorReporter.java:318)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocum entScannerImpl.scanStartElement(XMLNSDocumentScann erImpl.java:250)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocum entScannerImpl$NSContentDriver.scanRootElementHook (XMLNSDocumentScannerImpl.java:626)
at com.sun.org.apache.xerces.internal.impl.XMLDocumen tFragmentScannerImpl$FragmentContentDriver.next(XM LDocumentFragmentScannerImpl.java:3103)
at com.sun.org.apache.xerces.internal.impl.XMLDocumen tScannerImpl$PrologDriver.next(XMLDocumentScannerI mpl.java:922)
at com.sun.org.apache.xerces.internal.impl.XMLDocumen tScannerImpl.next(XMLDocumentScannerImpl.java:648)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocum entScannerImpl.next(XMLNSDocumentScannerImpl.java: 140)
at com.sun.org.apache.xerces.internal.impl.XMLDocumen tFragmentScannerImpl.scanDocument(XMLDocumentFragm entScannerImpl.java:511)
at com.sun.org.apache.xerces.internal.parsers.XML11Co nfiguration.parse(XML11Configuration.java:808)
at com.sun.org.apache.xerces.internal.parsers.XML11Co nfiguration.parse(XML11Configuration.java:737)
at com.sun.org.apache.xerces.internal.parsers.XMLPars er.parse(XMLParser.java:119)
at com.sun.org.apache.xerces.internal.parsers.Abstrac tSAXParser.parse(AbstractSAXParser.java:1205)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserI mpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:51 8)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:98 6)
at testerr.main(testerr.java:14)
at __SHELL14.run(__SHELL14.java:7)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Nativ e Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Native MethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(De legatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at bluej.runtime.ExecServer$3.run(ExecServer.java:792 )Last edited by nijil; 02-23-2010 at 03:56 PM.
- 02-23-2010, 05:54 PM #4
Senior Member
- Join Date
- Dec 2009
- Location
- Belgrade, Serbia
- Posts
- 364
- Rep Power
- 4
Whay are you using plain HTML file:
???Java Code:Document doc= parser.build("[B]http://www.imsc.res.in/local/Html/example_homepage.html[/B]");
Where is XML or XHTML?
- 02-23-2010, 06:01 PM #5
Senior Member
- Join Date
- Dec 2009
- Location
- Belgrade, Serbia
- Posts
- 364
- Rep Power
- 4
Let's go step back.
Do you know how to parse simple XML file
not with jDOM but with ordinary classes that comes with your java?
Are you completely familiar with rules of creating XHTML file?
If your answers are NO, don't rush into this it is quite essential
to know those things before we continue, please get familiar with that.
- 02-23-2010, 07:29 PM #6
Member
- Join Date
- Feb 2010
- Posts
- 14
- Rep Power
- 0
ya it is wrking with xml/xhtml..but..this is exactly what my prblem is
i want to get the whole content of a url into an object or file and frm there i want to extract text and urls which that page is linked to for a search engine....i tried using regular expression ..but i heard that jdom offers btter soln by refering to its tree structre and functions...so i tried to build a document from a url....but it started showing error...the soln u gave wrked for xml and xhtml file..THANX...pls give a soln fr other types..i jst know the basic of xml...i dont know that much
eg:https://wiki.ubuntu.com/ServerKarmic...owerManagement
OutputStream (Java 2 Platform SE v1.4.2)
- 02-23-2010, 07:44 PM #7
Senior Member
- Join Date
- Dec 2009
- Location
- Belgrade, Serbia
- Posts
- 364
- Rep Power
- 4
Let's clear thing up pls.
When you start some thread on forum by asking question
it' is natural to finish it by solving THAT question.
If new question are opened then pls start new thread.
Point is that one who reads Thread title can enter it and see solution for THAT question.
Now if your primary question is solved pls mark this Thread as SOLVED
and start new one with new title and new question.
If not let's solve that first, mark it and create new Thread for all other question.
Sorry for this but those are forum rules...
- 02-23-2010, 07:50 PM #8
Senior Member
- Join Date
- Dec 2009
- Location
- Belgrade, Serbia
- Posts
- 364
- Rep Power
- 4
Questions:
What are we doing ? Creating input for your search engine?
What we have to parse: HTML/XHTML/XML ? This is related to
Do we have to use jDOM?"...get the whole content...of a url
Maybe you star new Thread with
"How to strart gathering content from URLs for my search engine"
so we are all aware of what is needed.
Similar Threads
-
Sending text to a web page
By shyameni in forum Advanced JavaReplies: 2Last Post: 10-08-2009, 07:14 AM -
Sending text to a web page
By shyameni in forum Advanced JavaReplies: 0Last Post: 10-07-2009, 06:38 PM -
[SOLVED] How to Extract Data From this text file?
By jazz2k8 in forum New To JavaReplies: 31Last Post: 04-18-2008, 10:45 AM -
Extract Text from PDF File using java
By TSW1016 in forum Advanced JavaReplies: 5Last Post: 01-06-2008, 11:03 PM -
JDOM in java applications
By boy22 in forum Advanced JavaReplies: 1Last Post: 08-02-2007, 05:38 PM


LinkBack URL
About LinkBacks

Bookmarks