RegEx Is there a solution here?



  • I have a blacklist similar to this:

    idiot(a|i|e)?
    This block the idiot, idiota, idioti, idiote words.
    But some users bypass this check by changing some letters with others
    Ex: idi0t4
    How can I make a regex list for block bypass attempts mimilar to this?



  • find what : idiot.



  • @Pan-Jan ,

    Fighting blacklisted words is a difficult task. You have to consider all alternates, including letters, numerals, and other unicode symbols. There is no “magic” regex which will be able to recognize it for you.

    For a generic hint, I suggest using character classes in square brackets [aie], not the groups-with-alternations (a|i|e). There is no reason to store the result in a group. It is easier to type the class without all the | separators.

    Ignoring unicode (for now), a case-insensitive “idiot” regex might look something like (?i)[i|:!][d][i|:!][o0][t7][aie4]?, which says

    • case insensitive
    • things that look vaguely i-shaped
      • | here is used as a literal character, because it looks like a capital i, not because it’s being used for alternation… another good reason to avoid the group where | has special meaning,
    • things that look vaguely d-shaped
    • things that look vaguely i-shaped
    • things that look vaguely o-shaped
    • 0 or 1 things that look vaguely vowel-ending


  • But people have ideas.
    That should probably be enough.

    (?-i)[iіìílӏIƖІ][dԁDƊ][iіìílӏIƖІ][oοᴏOΟ0][tΤТƬ]
    


  • @Pan-Jan said in RegEx Is there a solution here?:

    But people have ideas.

    Indeed. I even had ideas, which I shared with you.

    (?ii)

    To me, it seems strange to turn off case-insensitive (with (?-i)), and then explicitly list both the lowercase and uppercase characters.

    If I were trying to accomplish this, but for some reason couldn’t do the filtering through a command line script instead of doing it manually for each word inside Notepad++, I would at least use a command-line script to help write each the individual regex: I’d have the script ask for a word, like idiot, and then it would do an internal lookup from each letter to the list of characters that I thought were similar. (For example, that mapping would be i => "[iіìílӏIƖІ]", d => "[dԁDƊ]", ... according to your similarity rules.)

    Good luck with this. Spam filters have been trying for years.



  • idiota
    ɪdıota
    iɗioti
    ƖDI0T4
    idiơte
    iԁioti
    
    

    There has to be a solution.

    Can this be so?
    [^ A-z1-9\n\r]|idiot



  • [^ A-z0-9\n\r]|idi[o0]t



  • I like the test word chosen here.


Log in to reply