View RSS Feed


Cassandra Build the index

Rate this Entry
by , 02-23-2012 at 07:27 PM (779 Views)
When the data is ready, next step is to store it into column family. All the tags that are created in tokenizer can be processed in this step. Tokenizer has provided us a list of tags with document IDs. With the help of this information, we can do the following:
Check the tags for duplication.
Write data to column family in Cassandra.

Java Code: This is the code to explain index buildning
private void tokenize(String doc, String docID) {

        //remove all none alpha numeric vals
        doc = doc.replaceAll("[^a-zA-Z0-9\\s]", "|_|");
        doc = doc.toLowerCase();//ensure everything is lower case
        String[] lTerms = doc.split("\\s");//split after each space
        for (String word : lTerms) {
            //add the word as a key, the docid as the column value and a ranom number x as column name
             * Inefficient way of doing this because it makes a trip to the DB for everyword.
             * A better way would be to get all the docs associated to a word (rowkey)
             * and then create all the columns and do a single batch operation on the db
            SimpleClient.cassandraClient.addTag(word, ("" + Math.random()).replace("0.", ""), docID);


Submit "Cassandra Build the index" to Facebook Submit "Cassandra Build the index" to Digg Submit "Cassandra Build the index" to Submit "Cassandra Build the index" to StumbleUpon Submit "Cassandra Build the index" to Google