Preg_replace whole word only – PHP

  php

Q(Question):

Im trying to make a naughty word filter. It removes bad words fine, but instances where there is a bad word found in the text like "assist" and "asses" get caught in the filter as well. Strangely though if the sentence is: My asses to assist me." the clean version will read: My asses to ***ist me." It seems to clear the first use of the word in another word, but then blocks the rest. Any ideas? My script is below. Thanks.


function cleanWords($value) {
/* strip naughty words */
$bad_word_file = 'standards/badwords.txt';
$strtofile = fopen($bad_word_file, "r");
$badwords = explode("\n", fread($strtofile, filesize($bad_word_file)));
fclose($strtofile);
for ($i = 0; $i < count($badwords); $i++) {
$wordlist .= str_replace(chr(13),'',$badwords[$i]).'|';
}
$wordlist = substr($wordlist,0,-1);
$value = preg_replace("/\b($wordlist)\b/ie", 'preg_replace("/./","*","\\1")', $value);
return $value;
}

A(Answer):

Hey.

If you print the $wordlist, does it look right?
I tested this by just creating the $wordlist manually and it seemed to work fine.

A(Answer):

yes $wordlist is correct. If it helps the wordlist is just over a 1000 words.

A(Answer):

Use the space character with or conditions.

(\s|^)(badword1|badword2)(\s|$)

That checks for either a space before the word or if it is at the start of the screen. Then checks for either a space or the end of the line.

A(Answer):

i ended up finding that the word "a.s.s." was in my list. I think the dots were messing up the expression. For thos interested, this is my new code. Thanks for any suggestions to get it where it is.


$_SESSION[wordlist] = join("|", array_map('trim', file('standards/badwords.txt')));
function cleanWords($value) {
global $_SESSION;
$value = preg_replace("/\b($_SESSION[wordlist])\b/ie", 'str_repeat("*", strlen("\\1")) ', $value);
return $value;
}

A(Answer):

Hey.
Glad you got it working.

However, I would consider using a different method. – Putting the whole thing into the session is very inefficient. The list remains constant for every user, and rarely changes (if ever) right? – If so, then compiling it for every user like that and storing it in separate sessions for each one is just doing two things: eating up resources and cluttering the sessions with duplicate data.

You would be far better of compiling the regular expression into a common file, shared between all users. – This is how I would do this. (Wouldn’t usually make a ready-to-use code example, but since you already solved this on your own…)

<?php
define("BADWORDS_RAW_FILE", "/path/to/badwords.txt");
define("BADWORDS_EXP_FILE", "/path/to/badwords_expression.txt");
/**
* Returns a regular expression that can be used to check
* for "bad" words. Returns an expression in the format:
* - /\b(list|of|bad|words)\b/i
*/
function getBadWordsRegexp()
{
$regexp = "";
// Try to fetch an existing expression.
if(!file_exists(BADWORDS_EXP_FILE) ||
filesize(BADWORDS_EXP_FILE) <= 0 ||
($regexp = file_get_contents(BADWORDS_EXP_FILE)) === false)
{
// Make sure the raw word list exists
if(!file_exists(BADWORDS_RAW_FILE)) {
trigger_error("The bad words file does not exists.", E_USER_ERROR);
return false;
}
// Compile the regular expression
$regexp = '/\b(' . join("|", array_map('trim', file(BADWORDS_RAW_FILE))) . ')\b/i';
// Try to save it
if(!is_writeable(BADWORDS_EXP_FILE) ||
!file_put_contents(BADWORDS_EXP_FILE, $regexp))
{
trigger_error("Could not save badwords expression. Check file permissions.", E_USER_WARNING);
}
}
// Return it
return $regexp;
}
?>

Then you could use it like:

<?php
function cleanWords($value) {
$regexp = getBadWordsRegexp();
return preg_replace($regexp . 'e', 'str_repeat("*", strlen("\\1")) ', $value);
}
?>

P.S.
I have a couple of notes on your code, though.You don’t need to import $_SESSION into functions using the global keyword. $_SESSION is a "super-global", which makes it available to you wherever you are in your code.
All strings need to be quoted. That includes array keys. Which means that:

// This
$_SESSION[wordlist];
// Should be
$_SESSION['wordlist'];

If you leave it out, PHP assumes it is a constant. Failing to find a constant, it prints a warning and uses it as a string (which is why it works, even thought it is technically an error.) – For future-compatibility and performance reasons (minor as they may be), it is best to just remember the strings.

A(Answer):

thanks Atli! your suggestions are much appreciated.

LEAVE A COMMENT