Results 1 to 5 of 5
- 10-17-2012, 12:15 PM #1
Member
- Join Date
- Oct 2012
- Posts
- 5
- Rep Power
- 0
Creating a profanity filter for user generated content
Hi guys!
I'm working on a Java project where i need to create a filter for filtering uploaded names.
The names are uploaded to the db from all over the world, so for-example, i don't want to blacklist a name like "Alassandra" which has "ass" in it.
Would appreciate some guidance, help where to look at, or some algorithm tips for creating my type of filter?
BTW this is my first post here =)
Regards
Arvin:
- 10-19-2012, 12:55 AM #2
Senior Member
- Join Date
- May 2012
- Posts
- 170
- Rep Power
- 1
Re: Creating a profanity filter for user generated content
Well what I would is this trick
",John,ass,Mark,"
make sure I have the spaces between the names or commas or anything and then if the name is ",ass," then blacklist, else, not black list..WARNING I am Russian so it's possible that I wont understand you correctly...
- 10-19-2012, 08:32 AM #3
- Join Date
- Sep 2008
- Location
- Voorschoten, the Netherlands
- Posts
- 11,380
- Blog Entries
- 7
- Rep Power
- 17
Re: Creating a profanity filter for user generated content
Use a Set that contains the dirty words and use the contains( ... ) method.
kind regards,
JosWhen people rob a bank they get a penalty; when banks rob people they get a bonus.
- 10-19-2012, 09:38 AM #4
Moderator
- Join Date
- Apr 2009
- Posts
- 10,438
- Rep Power
- 16
Re: Creating a profanity filter for user generated content
contains() is a bit simplistic, though, as you would hit the Scunthorpe issue.
At least I would have thought you would.Please do not ask for code as refusal often offends.
- 10-19-2012, 08:28 PM #5
Member
- Join Date
- Oct 2012
- Posts
- 1
- Rep Power
- 0
Re: Creating a profanity filter for user generated content
The problem with writing your own profanity filter is that it is quite difficult and depending on how comprehensive you want to be, could require months or years of coding time.
I built CleanSpeak, which uses natural language processing rules and progressive searching capabilities to filter profanity. I am still modifying the filtering algorithm nearly on a constant basis to make it better and faster. CleanSpeak has about 3 developer years of total time invested in it. Buying a solution might not be an option for you, but you should definitely do a build-vs-buy analysis before diving into code.
Here are some things to consider when building a profanity filter (we use smurf for all our examples at Inversoft):
- You can't use spaces as delimiters. People get around them easily using tricks like 's m u r fing'
- You must handle the classic embedding problems like scunthrope and assume
- Users often use other characters for letters like '$murf'
- Regular expressions are too slow to be effective if your blacklist is large (1000+ words and phrases which include conjugated verbs and nouns. Our blacklist has over 100,000 words, phrases, variations, conjugations, etc.)
- People often add words to profanity like 'smurfhead' or 'smurfface'
You should also check out StackOverflow. I've posted responses to a lot of the profanity filtering questions there with additional tips and tricks. If you are in the market for a profanity filter, check out CleanSpeak. Otherwise, you'll want to think about all of the edge cases and what type of performance you need.
Similar Threads
-
imagej and creating randomly generated data
By daggaz in forum New To JavaReplies: 0Last Post: 05-29-2012, 10:20 PM -
Auto generated email and restrict user to access only specific link or file.
By Chinnu55 in forum Advanced JavaReplies: 6Last Post: 05-29-2011, 12:14 PM -
Creating filter
By chaudhas in forum Advanced JavaReplies: 4Last Post: 06-25-2010, 01:24 PM -
web content filter or internet filter
By sundarjothi in forum Advanced JavaReplies: 3Last Post: 05-15-2008, 11:36 AM


LinkBack URL
About LinkBacks
Reply With Quote
Bookmarks