Results 1 to 5 of 5
  1. #1
    arvin is offline Member
    Join Date
    Oct 2012
    Rep Power

    Default Creating a profanity filter for user generated content

    Hi guys!

    I'm working on a Java project where i need to create a filter for filtering uploaded names.
    The names are uploaded to the db from all over the world, so for-example, i don't want to blacklist a name like "Alassandra" which has "ass" in it.

    Would appreciate some guidance, help where to look at, or some algorithm tips for creating my type of filter?

    BTW this is my first post here =)


  2. #2
    Lionlev is offline Senior Member
    Join Date
    May 2012
    Rep Power

    Default Re: Creating a profanity filter for user generated content

    Well what I would is this trick
    make sure I have the spaces between the names or commas or anything and then if the name is ",ass," then blacklist, else, not black list..
    WARNING I am Russian so it's possible that I wont understand you correctly...

  3. #3
    JosAH's Avatar
    JosAH is offline Moderator
    Join Date
    Sep 2008
    Voorschoten, the Netherlands
    Blog Entries
    Rep Power

    Default Re: Creating a profanity filter for user generated content

    Use a Set that contains the dirty words and use the contains( ... ) method.

    kind regards,

    Build a wall around Donald Trump; I'll pay for it.

  4. #4
    Tolls is offline Moderator
    Join Date
    Apr 2009
    Rep Power

    Default Re: Creating a profanity filter for user generated content

    contains() is a bit simplistic, though, as you would hit the Scunthorpe issue.
    At least I would have thought you would.
    Please do not ask for code as refusal often offends.

    ** This space for rent **

  5. #5
    voidmain is offline Member
    Join Date
    Oct 2012
    Rep Power

    Default Re: Creating a profanity filter for user generated content

    The problem with writing your own profanity filter is that it is quite difficult and depending on how comprehensive you want to be, could require months or years of coding time.

    I built CleanSpeak, which uses natural language processing rules and progressive searching capabilities to filter profanity. I am still modifying the filtering algorithm nearly on a constant basis to make it better and faster. CleanSpeak has about 3 developer years of total time invested in it. Buying a solution might not be an option for you, but you should definitely do a build-vs-buy analysis before diving into code.

    Here are some things to consider when building a profanity filter (we use smurf for all our examples at Inversoft):

    • You can't use spaces as delimiters. People get around them easily using tricks like 's m u r fing'
    • You must handle the classic embedding problems like scunthrope and assume
    • Users often use other characters for letters like '$murf'
    • Regular expressions are too slow to be effective if your blacklist is large (1000+ words and phrases which include conjugated verbs and nouns. Our blacklist has over 100,000 words, phrases, variations, conjugations, etc.)
    • People often add words to profanity like 'smurfhead' or 'smurfface'

    You should also check out StackOverflow. I've posted responses to a lot of the profanity filtering questions there with additional tips and tricks. If you are in the market for a profanity filter, check out CleanSpeak. Otherwise, you'll want to think about all of the edge cases and what type of performance you need.

Similar Threads

  1. imagej and creating randomly generated data
    By daggaz in forum New To Java
    Replies: 0
    Last Post: 05-29-2012, 11:20 PM
  2. Replies: 6
    Last Post: 05-29-2011, 01:14 PM
  3. Creating filter
    By chaudhas in forum Advanced Java
    Replies: 4
    Last Post: 06-25-2010, 02:24 PM
  4. web content filter or internet filter
    By sundarjothi in forum Advanced Java
    Replies: 3
    Last Post: 05-15-2008, 12:36 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts