RegEx Is there a solution here?
-
I have a blacklist similar to this:
idiot(a|i|e)?
This block the idiot, idiota, idioti, idiote words.
But some users bypass this check by changing some letters with others
Ex: idi0t4
How can I make a regex list for block bypass attempts mimilar to this? -
find what :
idiot.
-
@Pan-Jan ,
Fighting blacklisted words is a difficult task. You have to consider all alternates, including letters, numerals, and other unicode symbols. There is no “magic” regex which will be able to recognize it for you.
For a generic hint, I suggest using character classes in square brackets
[aie]
, not the groups-with-alternations(a|i|e)
. There is no reason to store the result in a group. It is easier to type the class without all the|
separators.Ignoring unicode (for now), a case-insensitive “idiot” regex might look something like
(?i)[i|:!][d][i|:!][o0][t7][aie4]?
, which says- case insensitive
- things that look vaguely i-shaped
|
here is used as a literal character, because it looks like a capitali
, not because it’s being used for alternation… another good reason to avoid the group where|
has special meaning,
- things that look vaguely d-shaped
- things that look vaguely i-shaped
- things that look vaguely o-shaped
- 0 or 1 things that look vaguely vowel-ending
-
But people have ideas.
That should probably be enough.(?-i)[iіìílӏIƖІ][dԁDƊ][iіìílӏIƖІ][oοᴏOΟ0][tΤТƬ]
-
@Pan-Jan said in RegEx Is there a solution here?:
But people have ideas.
Indeed. I even had ideas, which I shared with you.
(?ii)
To me, it seems strange to turn off case-insensitive (with
(?-i)
), and then explicitly list both the lowercase and uppercase characters.If I were trying to accomplish this, but for some reason couldn’t do the filtering through a command line script instead of doing it manually for each word inside Notepad++, I would at least use a command-line script to help write each the individual regex: I’d have the script ask for a word, like
idiot
, and then it would do an internal lookup from each letter to the list of characters that I thought were similar. (For example, that mapping would bei => "[iіìílӏIƖІ]", d => "[dԁDƊ]", ...
according to your similarity rules.)Good luck with this. Spam filters have been trying for years.
-
idiota ɪdıota iɗioti ƖDI0T4 idiơte iԁioti
There has to be a solution.
Can this be so?
[^ A-z1-9\n\r]|idiot
-
[^ A-z0-9\n\r]|idi[o0]t
-
I like the test word chosen here.