RegEx Is there a solution here?

  • I have a blacklist similar to this:

    This block the idiot, idiota, idioti, idiote words.
    But some users bypass this check by changing some letters with others
    Ex: idi0t4
    How can I make a regex list for block bypass attempts mimilar to this?

  • find what : idiot.

  • @Pan-Jan ,

    Fighting blacklisted words is a difficult task. You have to consider all alternates, including letters, numerals, and other unicode symbols. There is no “magic” regex which will be able to recognize it for you.

    For a generic hint, I suggest using character classes in square brackets [aie], not the groups-with-alternations (a|i|e). There is no reason to store the result in a group. It is easier to type the class without all the | separators.

    Ignoring unicode (for now), a case-insensitive “idiot” regex might look something like (?i)[i|:!][d][i|:!][o0][t7][aie4]?, which says

    • case insensitive
    • things that look vaguely i-shaped
      • | here is used as a literal character, because it looks like a capital i, not because it’s being used for alternation… another good reason to avoid the group where | has special meaning,
    • things that look vaguely d-shaped
    • things that look vaguely i-shaped
    • things that look vaguely o-shaped
    • 0 or 1 things that look vaguely vowel-ending

  • But people have ideas.
    That should probably be enough.


  • @Pan-Jan said in RegEx Is there a solution here?:

    But people have ideas.

    Indeed. I even had ideas, which I shared with you.


    To me, it seems strange to turn off case-insensitive (with (?-i)), and then explicitly list both the lowercase and uppercase characters.

    If I were trying to accomplish this, but for some reason couldn’t do the filtering through a command line script instead of doing it manually for each word inside Notepad++, I would at least use a command-line script to help write each the individual regex: I’d have the script ask for a word, like idiot, and then it would do an internal lookup from each letter to the list of characters that I thought were similar. (For example, that mapping would be i => "[iіìílӏIƖІ]", d => "[dԁDƊ]", ... according to your similarity rules.)

    Good luck with this. Spam filters have been trying for years.

  • idiota

    There has to be a solution.

    Can this be so?
    [^ A-z1-9\n\r]|idiot

  • [^ A-z0-9\n\r]|idi[o0]t

  • I like the test word chosen here.

