Results 1 to 8 of 8
  1. #1
    nijil is offline Member
    Join Date
    Feb 2010
    Posts
    14
    Rep Power
    0

    Default extract text from xhtml page using jdom

    i hav a web page(XHTML) in an inputstream object....how do i extract text from it using jdom...

  2. #2
    FON
    FON is offline Senior Member
    Join Date
    Dec 2009
    Location
    Belgrade, Serbia
    Posts
    366
    Rep Power
    6

    Default

    If you only need "TEXT" maybe you could try with ContentFilter class.

    Examples are here (from 15.4...):

    The Element Class


    "...suppose your application only needs to concern itself with elements and text, but can completely skip all comments and processing instructions..."

    First just parse all to see it original document is valid,
    then use TreePrinter class from 'Example 15.2. Inspecting elements'
    so see whole tree and after that use filters to extract wheat you need

    is this useful?

  3. #3
    nijil is offline Member
    Join Date
    Feb 2010
    Posts
    14
    Rep Power
    0

    Default no....i cant parse xhtml files with this....it gives error

    for example:15.2....i gess its entity resolver problem.....i dont know to solve....pls help



    i also tried this sample prog:




    import org.jdom.Document;
    import org.jdom.input.*;
    import org.jdom.Element;
    import org.jdom.output.XMLOutputter;

    public class testerr {

    public static void main(String[] args) throws Exception
    {

    SAXBuilder parser = new SAXBuilder();
    parser.setValidation(true);
    parser.setEntityResolver(new XhtmlEntityResolver());
    Document doc= parser.build("http://www.imsc.res.in/local/Html/example_homepage.html");



    XMLOutputter outputter = new XMLOutputter();
    outputter.output(doc, System.out);
    }
    }

    error for 15.2:


    Caused by: org.xml.sax.SAXParseException: Document is invalid: no grammar found.
    at com.sun.org.apache.xerces.internal.util.ErrorHandl erWrapper.createSAXParseException(ErrorHandlerWrap per.java:195)
    at com.sun.org.apache.xerces.internal.util.ErrorHandl erWrapper.error(ErrorHandlerWrapper.java:131)
    at com.sun.org.apache.xerces.internal.impl.XMLErrorRe porter.reportError(XMLErrorReporter.java:384)
    at com.sun.org.apache.xerces.internal.impl.XMLErrorRe porter.reportError(XMLErrorReporter.java:318)
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocum entScannerImpl.scanStartElement(XMLNSDocumentScann erImpl.java:250)
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocum entScannerImpl$NSContentDriver.scanRootElementHook (XMLNSDocumentScannerImpl.java:626)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumen tFragmentScannerImpl$FragmentContentDriver.next(XM LDocumentFragmentScannerImpl.java:3103)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumen tScannerImpl$PrologDriver.next(XMLDocumentScannerI mpl.java:922)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumen tScannerImpl.next(XMLDocumentScannerImpl.java:648)
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocum entScannerImpl.next(XMLNSDocumentScannerImpl.java: 140)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumen tFragmentScannerImpl.scanDocument(XMLDocumentFragm entScannerImpl.java:511)
    at com.sun.org.apache.xerces.internal.parsers.XML11Co nfiguration.parse(XML11Configuration.java:808)
    at com.sun.org.apache.xerces.internal.parsers.XML11Co nfiguration.parse(XML11Configuration.java:737)
    at com.sun.org.apache.xerces.internal.parsers.XMLPars er.parse(XMLParser.java:119)
    at com.sun.org.apache.xerces.internal.parsers.Abstrac tSAXParser.parse(AbstractSAXParser.java:1205)
    at com.sun.org.apache.xerces.internal.jaxp.SAXParserI mpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
    at org.jdom.input.SAXBuilder.build(SAXBuilder.java:51 8)
    at org.jdom.input.SAXBuilder.build(SAXBuilder.java:98 6)
    at testerr.main(testerr.java:14)
    at __SHELL14.run(__SHELL14.java:7)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Nativ e Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Native MethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(De legatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at bluej.runtime.ExecServer$3.run(ExecServer.java:792 )
    Last edited by nijil; 02-23-2010 at 04:56 PM.

  4. #4
    FON
    FON is offline Senior Member
    Join Date
    Dec 2009
    Location
    Belgrade, Serbia
    Posts
    366
    Rep Power
    6

    Default

    Whay are you using plain HTML file:

    Java Code:
    Document doc= parser.build("[B]http://www.imsc.res.in/local/Html/example_homepage.html[/B]");
    ???

    Where is XML or XHTML?

  5. #5
    FON
    FON is offline Senior Member
    Join Date
    Dec 2009
    Location
    Belgrade, Serbia
    Posts
    366
    Rep Power
    6

    Default

    Let's go step back.

    Do you know how to parse simple XML file
    not with jDOM but with ordinary classes that comes with your java?

    Are you completely familiar with rules of creating XHTML file?

    If your answers are NO, don't rush into this it is quite essential
    to know those things before we continue, please get familiar with that.

  6. #6
    nijil is offline Member
    Join Date
    Feb 2010
    Posts
    14
    Rep Power
    0

    Default ya it is wrking with xml/xhtml..but..this is exactly what my prblem is

    i want to get the whole content of a url into an object or file and frm there i want to extract text and urls which that page is linked to for a search engine....i tried using regular expression ..but i heard that jdom offers btter soln by refering to its tree structre and functions...so i tried to build a document from a url....but it started showing error...the soln u gave wrked for xml and xhtml file..THANX...pls give a soln fr other types..i jst know the basic of xml...i dont know that much

    eg:https://wiki.ubuntu.com/ServerKarmic...owerManagement
    OutputStream (Java 2 Platform SE v1.4.2)

  7. #7
    FON
    FON is offline Senior Member
    Join Date
    Dec 2009
    Location
    Belgrade, Serbia
    Posts
    366
    Rep Power
    6

    Default

    Let's clear thing up pls.

    When you start some thread on forum by asking question
    it' is natural to finish it by solving THAT question.

    If new question are opened then pls start new thread.

    Point is that one who reads Thread title can enter it and see solution for THAT question.

    Now if your primary question is solved pls mark this Thread as SOLVED
    and start new one with new title and new question.
    If not let's solve that first, mark it and create new Thread for all other question.

    Sorry for this but those are forum rules...

  8. #8
    FON
    FON is offline Senior Member
    Join Date
    Dec 2009
    Location
    Belgrade, Serbia
    Posts
    366
    Rep Power
    6

    Default

    Questions:

    What are we doing ? Creating input for your search engine?

    What we have to parse: HTML/XHTML/XML ? This is related to
    "...get the whole content...of a url
    Do we have to use jDOM?

    Maybe you star new Thread with
    "How to strart gathering content from URLs for my search engine"
    so we are all aware of what is needed.

Similar Threads

  1. Sending text to a web page
    By shyameni in forum Advanced Java
    Replies: 2
    Last Post: 10-08-2009, 08:14 AM
  2. Sending text to a web page
    By shyameni in forum Advanced Java
    Replies: 0
    Last Post: 10-07-2009, 07:38 PM
  3. [SOLVED] How to Extract Data From this text file?
    By jazz2k8 in forum New To Java
    Replies: 31
    Last Post: 04-18-2008, 11:45 AM
  4. Extract Text from PDF File using java
    By TSW1016 in forum Advanced Java
    Replies: 5
    Last Post: 01-07-2008, 12:03 AM
  5. JDOM in java applications
    By boy22 in forum Advanced Java
    Replies: 1
    Last Post: 08-02-2007, 06:38 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •