Results 1 to 20 of 186
Thread: Capturing Data From A Html File
- 05-19-2010, 02:49 PM #1
Member
- Join Date
- May 2010
- Location
- Buckinghamshire
- Posts
- 77
- Rep Power
- 0
Capturing Data From A Html File
Hi guys im new to Java programming and my boss has asked me to do something thats puzzling me. I know the method i need to follow with regards to standard programming principles but having never used java im unsure how to go about it.
Basically I need to read in a html file from a java program and extract dynamichtml resource text references from it (im guessing by using a wildcard) once I have extracted them I want to put them in an array container and sort them alphabetically before outputting them.
If anyone can help me i'd really be greatful, its giving me a hard time.
regards Nick:)Last edited by nickrowe_2k; 05-19-2010 at 03:13 PM. Reason: update
- 05-19-2010, 03:25 PM #2
Moderator
- Join Date
- Apr 2009
- Posts
- 10,484
- Rep Power
- 16
What have you done so far?
And tell your boss that getting people to do work on Java who know nothing about Java without even giving them a training course is out and out silly.
- 05-19-2010, 03:52 PM #3
Member
- Join Date
- May 2010
- Location
- Buckinghamshire
- Posts
- 77
- Rep Power
- 0
So Far I have A Reader which i found online and an array which I have created using a tutorial.
I realise what I have isn't much to go by. And with my knowledge from programming that I gained at uni I know what path I need to follow. I just dont have a clue about how to go about it. I wish I could just do it in Javascript lol.
Basically this is driving me mad and I agree its silly. within the html file i have there are many references to resources such as dynamichtml ........name...
I need to capture the names of all these instances and store them in an array until the reader/buffer completes the document. When all the instances have been collected i need to sort them alphabetically and print them out.
I know i need some kind of autoarray but so far this is all i've been able to find as a base. Its ridiculous, but when ur a junior and are getting thrown this stuff what do u do. He's sort of a put u in an office on your own and figure it out bloke lol.
import java.io.BufferedInputStream;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
File myhtml = new File("C:\Documents and Settings\Kieren McDonald\Desktop\Nick\Java\my.html");
FileInputStream fileinput = null;
BufferedInputStream mybuffer = null;
DataInputStream datainput = null;
fileinput = new FileInputStream(myhtml);
mybuffer = new BufferedInputStream(fileinput);
datainput = new DataInputStream(mybuffer);
while (datainput.available() != 0) {
System.out.println(datainput.readLine());
}
myHTML.close();
mybuffer.close();
datainput.close();
class Array {
public static void main(String[] args) {
String[] anArray; // declares an array of strings
anArray = String[3]; // allocates memory for 3 strings
anArray[0] = dynamichtml TimelineManager_top_links; // initialize first element
anArray[1] = dynamichtml TimelineManager_footer; // initialize second element
anArray[2] = dynamichtml TimelineManager_quicksearch_form; // etc.
System.out.println("Resource Name 0: " + anArray[0]);
System.out.println("Resource Name 1: " + anArray[1]);
System.out.println("Resource Name 2: " + anArray[2]);
}
}
- 05-19-2010, 03:52 PM #4
- Join Date
- Sep 2008
- Location
- Voorschoten, the Netherlands
- Posts
- 11,601
- Blog Entries
- 7
- Rep Power
- 17
You have a lot to read: start reading the API documentation for every class that starts with HTML and read about the parser that does the job: the DocumentParser class. When such an object parses the HTML text it does all the nitty-gritty work for you and it calls a HTMLEditorKit.ParserCallBack object for everything interesting it found. You have to write that callback class by (preferably?) extending from the HTMLEditorKit.ParserCallBack class.
It may seem confusing at first but the entire scenario resembles the SAX parser approach (for XML).
kind regards,
Jos
- 05-19-2010, 04:04 PM #5
Could you define what it is you are trying to extract from an html page, perhaps with some examples.dynamic html resource text references
- 05-19-2010, 04:10 PM #6
Member
- Join Date
- May 2010
- Location
- Buckinghamshire
- Posts
- 77
- Rep Power
- 0
Explanation
Ok basically i have a html file which contains several resource references to components used on a content management system.
Using my java program i need to extract the names of these references which all start with dynamichtml (followed by their name i.e. quicksearch_form)
So with my java program i need to read in the html file, im guessing have a wild card that searches for an instance of what comes after dynamichtml and then stores it in a point within an array. Once the instance has been stored the process needs to loop until the end of the document. Once the document has been read the array needs to be sorted alphabetically and then outputted.
This is so when in component wizard the resources appear alphabetically instead of how they appear to be called within the html file.
- 05-19-2010, 04:16 PM #7
Sorry, my question is: What do you mean by dynamichtml? Is that javascript or ???
And what are the resources? URLs or ???
Or is "dynamichtml" the prefix for some variable or data or what?references which all start with dynamichtml
- 05-19-2010, 04:23 PM #8
Member
- Join Date
- May 2010
- Location
- Buckinghamshire
- Posts
- 77
- Rep Power
- 0
Below is an example or a resource I have pasted in from the body of the html file.
:)
<@dynamichtml TimelineManager_quicksearch_form@>
<div>
<form name="QUICK_SEARCHFORM" method="GET" action="<$ssNodeLink(80013)$>" style="PADDING-RIGHT: 0px; DISPLAY: inline; PADDING-LEFT: 0px; PADDING-BOTTOM: 0px; MARGIN: 0px; PADDING-TOP: 0px">
<input type=hidden name="QueryText" value="">
<input type=hidden name="ResultCount" value="<$AdvancedSearch_ResultCount$>">
<input class="searchField" accesskey="f" size="16" name="searchStr">
<input class="searchButton" accesskey="g" type="button" value="Search" onclick="QuickSearch_BuildQueryTextAndSubmit()">
</form>
</div>
<@end@>
kind regards Nick :)
- 05-19-2010, 04:28 PM #9
Thanks.
Who/what processes the <@ ... @> tags? And the <$...$>
- 05-19-2010, 04:35 PM #10
Moderator
- Join Date
- Apr 2009
- Posts
- 10,484
- Rep Power
- 16
I suspect the HTML parser stuff Jos mentioned won't handle that terribly well.
I could be wrong, though.
If, as asked by Norm, you knew what did the initial processing of these you might be able to nick the code that searches for these things? That would be half the battle sorted out...
- 05-19-2010, 04:37 PM #11
Member
- Join Date
- May 2010
- Location
- Buckinghamshire
- Posts
- 77
- Rep Power
- 0
I believe the tags are processed by the component wizard installed with the ucm, but my boss has explained that it can all be done via a java program.
What i dont understand is that 1 if i create an array for each resource in javascript rather than java its quicker and two I can easily go into each resource and manually arrange them so that they output in the same way anyway.
The idea is that the java program sees an instance of a resource captures, stores and then outputs it in an ascending order.
Obv this would be quicker once the program is completed but having not worked with java its like a slap in the face. I figured that by using the reader to read the html file and convert it to a string i could simply look for an instance of dynamichtml as a string using some sort of wildcard.
hope that makes sense :)
- 05-19-2010, 04:52 PM #12
Member
- Join Date
- May 2010
- Location
- Buckinghamshire
- Posts
- 77
- Rep Power
- 0
Hi Tolls, Norm
Yeah initially i thought the same however, when i opened up the component wizard, looked at a resource file and then opened up the java tab there was nothing in there. I thought if I could read the code I could just modify it a little but unfortunately no luck.
Good idea though
- 05-19-2010, 04:54 PM #13
Moderator
- Join Date
- Apr 2009
- Posts
- 10,484
- Rep Power
- 16
Bang goes my favourite trick...:)
- 05-19-2010, 04:58 PM #14
Member
- Join Date
- May 2010
- Location
- Buckinghamshire
- Posts
- 77
- Rep Power
- 0
Haha, yeh mine too actually, sooooo much easier.
Having not worked with Java before, i knew id at least be able to modify some existing code and find some bits and pieces on the net to guide me the rest of the way, unfortunately I cant find ANY tutorials or instances of what i am finding to do.
I simply cant believe that no one has attempted to take several occurances of a string from a html file and bring it over to be sorted and outputted in a java program. I would have thought that this type of thing would be common.
I mean if a reader converts EVERYTHING into a string then why cant i just search for a string of text referencing dynamictml, even if it doesnt bring up the variable resource it should still bring back the resource name shouldn't it? or is that just an uneducated answer to my java problem?
- 05-19-2010, 05:06 PM #15
Moderator
- Join Date
- Apr 2009
- Posts
- 10,484
- Rep Power
- 16
So you just want the <@dynamichtml blahblahblah@> bit?
Not the stuff after it to the <@end@> part?
- 05-19-2010, 05:10 PM #16
Member
- Join Date
- May 2010
- Location
- Buckinghamshire
- Posts
- 77
- Rep Power
- 0
Yep i only need the name(s) and to sort them alphabetically once i have them.
I cant search for the names EXACTLY as they are different for each htm file. im simply working on the one at the moment. So i need to write some code that will extract the name AFTER dynamichtml and sort it. thats all :)
- 05-19-2010, 05:11 PM #17
Member
- Join Date
- May 2010
- Location
- Buckinghamshire
- Posts
- 77
- Rep Power
- 0
Dude if you can help ur getting the biggest hi-5 of your life and definately a god star, possibly a chocolate biscuit and the title of absolute legend to go with it :)
- 05-19-2010, 05:13 PM #18
Moderator
- Join Date
- Apr 2009
- Posts
- 10,484
- Rep Power
- 16
OK, re-reading your OP...
Read the strings in.
You can use indexOf() (I think that's the method name) to see if the "<@dynamichtml" is in there....store the number.
(Someone might have a funkier regex for hunting these down, but I'm just doing a brute force thing)
So, you now know the index of the '<' character, so you can offset to the bit you're interested in, and run to the indexOf("@>"), assuming you don't have more than one of these things on a line.
Store that resulting string.
That is, use substring().
Here's the String api, which has all this stuff.
That'll allow you to populate your ArrayList.
- 05-19-2010, 05:22 PM #19
How does being in an Html file concern the project. It looks like this is just a String search project.
- 05-19-2010, 05:27 PM #20
Moderator
- Join Date
- Apr 2009
- Posts
- 10,484
- Rep Power
- 16
Similar Threads
-
How can I include a html file in html textarea?
By surya_dks in forum New To JavaReplies: 2Last Post: 10-04-2008, 07:20 AM -
get data from servlet to html
By lema in forum Java ServletReplies: 7Last Post: 05-22-2008, 04:00 PM -
get data from html to servlet
By lema in forum Java ServletReplies: 66Last Post: 04-09-2008, 02:43 PM -
how to search xml file data based on the given keyword from html form?
By nicemothi in forum XMLReplies: 0Last Post: 04-04-2008, 09:36 AM -
how to upload a file along with html form data
By pranith in forum Java ServletReplies: 3Last Post: 07-30-2007, 02:33 AM


LinkBack URL
About LinkBacks
Reply With Quote
Bookmarks