Re: Lucene indexing problem
You can always simulate/fake a root element <root> ... </root> and let the xml stuff do the rest. Faking the root element can be done by an InputStream or Reader that wraps the original InputStream/Reader.
kind regards,
Jos
Re: Lucene indexing problem
Sounds good, could I use a HTML parser also? I tried using JSoup but I could only figure out how to parse strings.
Re: Lucene indexing problem
Quote:
Originally Posted by
Blacky777
Sounds good, could I use a HTML parser also? I tried using JSoup but I could only figure out how to parse strings.
You can use a html parser as much as you can use, say, a C++ parser for parsing Java ...
kind regards,
Jos
Re: Lucene indexing problem
Okay I've used jsoup to extract the information from the angled brackets however I have another question. How do I get it to parse type fileReader instead of type string. My code for parsing is as below
EDIT Altered the code as shown below, now it reads one of my document files, the problem is, there are tons of DOCs within 1 file and when I print (with the last command) it only gives me the first DOC returned.
Code:
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class TestClass2
{
public static void main(String args[]) throws IOException
{
FileReader fr = null;
File input = new File("WSJ_0402");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
//extracts the data from the <>'s
Element DOC = doc.select("DOC").first();
Element DOCNO = doc.select("DOCNO").first();
Element DOCID = doc.select("DOCID").first();
Element HEADLINE = doc.select("HL").first();
Element DATE = doc.select("DATE").first();
Element SOURCE = doc.select("SO").first();
Element COMPANY = doc.select("CO").first();
Element INDUSTRY = doc.select("IN").first();
Element INTRODUCTION = doc.select("LP").first();
Element ARTICLE = doc.select("TEXT").first();
//just changes the data inside the <>'s to a string
String linkText = DOC.text();
String linkText2 = DOCNO.text();
String linkText3 = DOCID.text();
String linkText4 = HEADLINE.text();
String linkText5 = DATE.text();
String linkText6 = SOURCE.text();
String linkText7 = COMPANY.text();
String linkText8 = INDUSTRY.text();
String linkText9 = INTRODUCTION.text();
String linkText10 = ARTICLE.text();
System.out.println(linkText);
}
}
Re: Lucene indexing problem
Quote:
Originally Posted by
Blacky777
Okay I've used jsoup to extract the information from the angled brackets however I have another question. How do I get it to parse type fileReader instead of type string. My code for parsing is as below
<snip>
I don't know jsoup; aamof, I don't like any soup. Why not use an XMLReader? (it's in the Java core set of classes).
kind regards,
Jos
Re: Lucene indexing problem
haha not the biggest fan of soup either :o:
Can finally read all instances of each <>
now i just need to get the 2 classes to communicate, WSJ_0402 is the name of 1 of the html files.
Code:
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class TestClass2
{
public static void main(String args[]) throws IOException
{
FileReader fr = null;
File input = new File("WSJ_0402");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
//extracts the data from the <>'s
Elements DOC = doc.select("DOC");
Elements DOCNO = doc.select("DOCNO");
Elements DOCID = doc.select("DOCID");
Elements HEADLINE = doc.select("HL");
Elements DATE = doc.select("DATE");
Elements SOURCE = doc.select("SO");
Elements COMPANY = doc.select("CO");
Elements INDUSTRY = doc.select("IN");
Elements INTRODUCTION = doc.select("LP");
Elements ARTICLE = doc.select("TEXT");
//just changes the data inside the <>'s to a string
String linkText = DOC.text();
String linkText2 = DOCNO.text();
String linkText3 = DOCID.text();
String linkText4 = HEADLINE.text();
String linkText5 = DATE.text();
String linkText6 = SOURCE.text();
String linkText7 = COMPANY.text();
String linkText8 = INDUSTRY.text();
String linkText9 = INTRODUCTION.text();
String linkText10 = ARTICLE.text();
}
}
EDIT A problem I face is that when I try to add a string to a field, it adds all instances as 1 entry for example i try to add dates as
Code:
ind.add(new StringField("DATE", linkText5, null));
Luke (a .jar that displays my index) shows dates has 1 entry: 04/02/90 04/02/90 05/02/90......... is there a way to separate the terms? I know .first() gets the first term, but is there a way to 'pop' this term instead of 'peeking'?
Re: Lucene indexing problem
Sorry, I'm bailing out here: I don't know jsoup and I don't know Luke (is that also some kind of soup?)
kind regards,
Jos
Re: Lucene indexing problem
Thanks for your assistance so far JosAH
I've managed to index the terms individually, the code is as shown below
Code:
package org.apache.lucene;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.*;
import java.nio.CharBuffer;
import java.util.ArrayList;
/**
* This terminal application creates an Apache Lucene index in a folder and adds files into this index
* based on the input of the user.
*/
public class TextFileIndexer {
private static StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
private IndexWriter writer;
private ArrayList<File> queue = new ArrayList<File>();
public static void main(String[] args) throws IOException {
System.out.println("Enter the path where the index will be created");
String indexLocation = null;
BufferedReader br = new BufferedReader(
new InputStreamReader(System.in));
String s = br.readLine();
TextFileIndexer indexer = null;
try {
indexLocation = s;
indexer = new TextFileIndexer(s);
} catch (Exception ex) {
System.out.println("Cannot create index..." + ex.getMessage());
System.exit(-1);
}
//===================================================
//read input from user until he enters q for quit
//===================================================
while (!s.equalsIgnoreCase("q")) {
try {
System.out.println("Enter the full path to add into the index");
System.out.println("[Acceptable file types: .xml, .html, .html, .txt]");
s = br.readLine();
if (s.equalsIgnoreCase("q")) {
break;
}
//try to add file into the index
indexer.indexFileOrDirectory(s);
} catch (Exception e) {
System.out.println("Error indexing " + s + " : " + e.getMessage());
}
}
//===================================================
//call the closeIndex, otherwise the index is not created
//===================================================
indexer.closeIndex();
//=========================================================
// Now search
//=========================================================
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(indexLocation)));
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(5, true);
s = "";
while (!s.equalsIgnoreCase("q")) {
try {
System.out.println("Enter the search query (q=quit):");
s = br.readLine();
if (s.equalsIgnoreCase("q")) {
break;
}
Query q = new QueryParser(Version.LUCENE_40, "contents", analyzer).parse(s);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("path") + " score=" + hits[i].score);
}
} catch (Exception e) {
System.out.println("Error searching " + s + " : " + e.getMessage());
}
}
}
/**
* Constructor
* @param indexDir the name of the folder in which the index should be created
* @throws java.io.IOException when exception creating index.
*/
TextFileIndexer(String indexDir) throws IOException {
// the boolean true parameter means to create a new index everytime,
// potentially overwriting any existing files there.
FSDirectory dir = FSDirectory.open(new File(indexDir));
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
writer = new IndexWriter(dir, config);
}
/**
* Indexes a file or directory
* @param fileName the name of a text file or a folder we wish to add to the index
* @throws java.io.IOException when exception
*/
public void indexFileOrDirectory(String fileName) throws IOException {
//===================================================
//gets the list of files in a folder (if user has submitted
//the name of a folder) or gets a single file name (is user
//has submitted only the file name)
//===================================================
addFiles(new File(fileName));
int originalNumDocs = writer.numDocs();
for (File f : queue) {
FileReader fr = null;
FileReader frDate = null;
try {
Document ind = new Document();
//===================================================
// add contents of file
//===================================================
fr = new FileReader(f);
File input = new File("WSJ_0402"); //only 1 file right now
org.jsoup.nodes.Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
//extracts the data from the <>'s
Elements DOC = doc.select("DOC");
Elements DOCNO = doc.select("DOCNO");
Elements DOCID = doc.select("DOCID");
Elements HEADLINE = doc.select("HL");
Elements DATE = doc.select("DATE");
Elements SOURCE = doc.select("SO");
Elements COMPANY = doc.select("CO");
Elements INDUSTRY = doc.select("IN");
Elements INTRODUCTION = doc.select("LP");
Elements ARTICLE = doc.select("TEXT");
//changes the data inside the <>'s to a string and adds the data to
//relevant field for each instance of DOC
for (int i=0; i<DOC.size();i++){
String linkText2 = DOCNO.get(i).toString();
ind.add(new StringField("DOCNO", linkText2, null));
String linkText3 = DOCID.get(i).toString();
ind.add(new StringField("DOCID", linkText3, null));
String linkText4 = HEADLINE.get(i).toString();
ind.add(new StringField("HEADLINE", linkText4, null));
String linkText5 = DATE.get(i).toString();
ind.add(new StringField("DATE", linkText5, null));
String linkText6 = SOURCE.get(i).toString();
ind.add(new StringField("SOURCE", linkText6, null));
String linkText7 = COMPANY.get(i).toString();
ind.add(new StringField("COMPANY", linkText7, null));
String linkText8 = INDUSTRY.get(i).toString();
ind.add(new StringField("INDUSTRY", linkText8, null));
String linkText9 = INTRODUCTION.get(i).toString();
ind.add(new StringField("INTRODUCTION", linkText9, null));
String linkText10 = ARTICLE.get(i).toString();
ind.add(new StringField("ARTICLE", linkText10, null));
writer.addDocument(ind);
}
System.out.println("Added: " + f);
} catch (Exception e) {
System.out.println("Could not add: " + f);
} finally {
fr.close();
}
}
int newNumDocs = writer.numDocs();
System.out.println("");
System.out.println("************************");
System.out.println(writer.numDocs() + " terms added.");
System.out.println("************************");
queue.clear();
}
private void addFiles(File file) {
if (!file.exists()) {
System.out.println(file + " does not exist.");
}
if (file.isDirectory()) {
for (File f : file.listFiles()) {
addFiles(f);
}
} else {
String filename = file.getName().toLowerCase();
//===================================================
// Only index text files
//===================================================
if (filename.endsWith("") || filename.endsWith(".html") ||
filename.endsWith(".xml") || filename.endsWith(".txt")) {
queue.add(file);
} else {
System.out.println("Skipped " + filename);
}
}
}
/**
* Close the index.
* @throws java.io.IOException when exception closing
*/
public void closeIndex() throws IOException {
writer.close();
}
}
Problem is though, that this adds nearly 20,000 documents (same number of total field entries) when there should be 194. Some have missing fields which adds to the problem further too. :(-:
Re: Lucene indexing problem
Can somebody tell me what the null value points to in this command
Code:
ind.add(new StringField("DOCID", linkText3, null))
It says Store stored but I'm not sure what this means? Is this maybe why all my stored fields are not connected to one another as an individual document?