REGEX Exceptions - Match everything on line EXCEPT this words



  • hello, I don’t know what is the formula for exceptions. So, I don’t know how to match some words, EXCEPT other words. For example:

    • A housewife is a woman whose occupation is running or managing her family’s home—caring … A housewife may also be called a stay-at-home mother and a male homemaker may also be called a stay-at-home father or househusband

    I need to match all words, EXCEPT this two: “homemaker” and “family’s”.

    I try minus sign, something like this: (^.*)-?(homemaker|family's)(.*)$ but it not working too good.



  • Hi Vasile and All,

    First of all, we must try to be as accurate as possible in our requests :

    For instance, let’s suppose the simple text, below :

    The man's hat isn't on the coat-stand that I saw in the corridor 
    

    Which words do you want, Vasile, to be considered as words ?

    • The individual words The, hat, on, the, that, saw, in, the and corridor, of course

    • The pronoun I, I presume

    • The two words man and s ( of the English possessive case ) OR the word man, only, or none of the both ?

    • The two words isn and t ( English verbal form ) OR the string isn, only, or none of the both ?

    • The two individual words coat and stand OR the group coat-stand ?

    In my regex, below, I consider :

    • The strings man and isn, as well as the pronoun I, as true words

    • The words, separated by an hyphen, like coat-stand, as a single composed word


    First of all, studying your example, I found out that some characters, as the classical hyphen character ( \x2d) and the apostrophe ( \x27 ), were changed by the NodeBB interface to some Unicode equivalent characters.

    So I suppose that your exact example should be :

    A housewife is a woman whose occupation is running or managing her family's home-caring …
    A housewife may also be called a stay-at-home mother and a male homemaker may also be called a stay-at-home father or househusband
    

    I just add a line-break after the horizontal ellipsis character ( \x{2026} ), for making lines shorter !


    • Just for memory, the simple regex to match words is, basically, \b\w+\b

    • Starting with this simple regex, I, then, tried the regex \b[\w-]+\b to consider composed words with an hyphen as single words

    • And, to prevent the match of the two expressions family’s and homemaker, I added a negative look-ahead : \b(?!homemaker|family's)[\w-]+\b

    Note :

    At ANY position, reached by the regex engine, when it matches a single word character or an hyphen, the negative look-ahead verifies that the expression homemaker OR family’s is NOT present. If one of these expressions exists, the regex engine skip it, as an overall match cannot be found !

    So, nice ! As expected, the word homemaker is NOT taken in account. However, in the family’s expression, it, still, matches the single letter s
    ( Hum…, not that we want to ! )

    • Thus, we must get rid of that single letter s, preceded by an apostrophe and followed by a non-word character. To do so, we just have to add the form (?<=')[\l\u]\b, inside the negative look-ahead, as an other alternative to verify !

    Therefore, we get our final regex, below, which, in addition, matches in an insensitive way, by construction :-))

    \b(?!homemaker|family's|(?<=')[\l\u]\b)[\w-]+\b

    Best Regards,

    guy038



  • hello again, guy38. Your regex seems fine, thank you.

    You really are a guru on regex !



  • one question, guy38… Your regex match those 2 words, but If I try an "Replace all’, will replace the whole text in the document, except those 2 words, even if other lines do not have those words.

    Your regex is good, but what if I want to replace only those lines that contains those 2 words, not the entire text , and to Replace All ?

    I add ^.* at the beginning of your regex, and is not good.



  • Hi, Vasile,

    As you said in your first post :

    I need to match all words, EXCEPT this two: “homemaker” and “family’s”.

    I thought that you just wanted to do a simple search, without any replacement.

    So, finally, from your last post, not only your want to perform a replacement but also, if I fully understood you, you would like that this replacement occurs, EXCLUSIVELY, in lines which contain, either, the word homemaker or the expression family’s or the both expressions. Am I right about it ?

    Secondly, Vasile, when a S/R should occur, what are you looking for, exactly ( single words, I presume or what else ? ) and by which expression, do you want to replace the different search matches ?

    See you later,

    guy038



  • hello Guy038. I want to match only the lines that contains “Word_1” and “Word_2”, and to Replace or Delete everything on that lines except those to words.

    Your regex from yesterday match those 2 words, but on Replace will delete everything on the entire file, not only the content of the lines with those 2 words.



  • Hi, Vasile,

    Starting with what you said, in your first post :

    I need to match all words, EXCEPT this two: “homemaker” and “family’s”.

    I imagined a test text with your previous example text ( Line 1 and 2 ), followed by lines 3 to 8, where :

    • Lines 3 and 6 do NOT contain the word homemaker, NOR the expression family’s

    • The line 4 contains, both, the two special expressions hommaker and family’s

    • The expression family’s is located at beginning of line 5

    • The word homemaker is located at the end of line 7

    • The last line 8 contains several occurrences, of each of the two expressions hommaker and family’s

    So, we’ll perform the further S/R, on the original text, below :

    A housewife is a woman whose occupation is running or managing her family's home-caring …
    A housewife may also be called a stay-at-home mother and a male homemaker may also be called a stay-at-home father or househusband
    Line 3, without the two words
    This is a homemaker test to see if family's all is OK
    Family's this is a test to see if homemaker all is OK
    Line 6, without the two words, too !
    This is a test to see if all is OK homemaker
    This homemaker is family's a homemaker test to family's see if family's all homemaker is OK
    

    May be, a shorter regex is possible, but I ended with the following S/R :

    SEARCH (?i)^((?!homemaker|family's).)+$|.*?(homemaker|family's)|.+

    REPLACE (?1$0)(?2\2|)

    NOTES :

    • I added the (?i) in-line modifier, as I presumed that the special words to search for, must be found, whatever their case. Just change it by (?-i), if you prefer a search case sensitive !

    • Then the regex consists of the choice between 3 alternatives :

      • The regex ^((?!homemaker|family's).)+$, which matches all the standard characters of lines which do NOT contain ANY special word

      • The regex .*?(homemaker|family's), which matches ANY text, till ONE of the nearest special word, included

      • The regex .+, which matches all the remaining characters, after the last special word, of the current line

    • Additional explanation of the first alternative :

      • I wanted to match all the standard characters of lines, which do NOT contain any special word. So, I started with the simple regex ^.+$ and, then, added a negative look-ahead (?!homemaker|family's) , which verifies, that, at ANY position of the caret, NO special word can be found, from the very beginning to the end of that line !

      • To force the regex engine to perform the negative look-ahead, at each position reached, I enclosed this look-ahead, followed the dot symbol inside a round parentheses block : ((?!homemaker|family's).)

      • Then this verification is spread along, till the end of a line, and all the characters, of the line, are taken in account, with the +$ form

    • When the first searched alternative is matched, the group 1 (?!homemaker|family's). exists . So, in replacement, the regex (?1$0) rewrites all the contents of the line, WITHOUT any modification

    • When the second searched alternative is matched, the group 2 (homemaker|family's) represents ONE of the special words. So the replacement regex (?2\2|) just writes this special word, followed by a | separator word

    • Finally, when the third searched alternative, without any group, is matched ( all the remaining characters after the last special word ), NO replacement occurs, at all !

    After performing that S/R, we get the final text, below :

    family's|
    homemaker|
    Line 3, without the two words
    homemaker|family's|
    Family's|homemaker|
    Line 6, without the two words, too !
    homemaker|
    homemaker|family's|homemaker|family's|family's|homemaker|
    

    I hope, Vasile, that it’sexactly what you need !

    Best Regards,

    guy038



  • hi, Guy38. No, this is not exactly what I need. It is MORE THAN I NEED !

    THANKS A LOT !!



  • Hi, Vasile,

    I forgot to tell you, in my previous post, that you may, of course, add, as many special words, as you want to ! Just add these words as an alternative to the two special words of my previous regex.

    For instance, if, in addition to homemaker and family’s, you decide that the word test and the verb see are special expressions, too, just change the S/R search regex by the regex below :

    SEARCH (?i)^((?!homemaker|family's|test|see).)+$|.*?(homemaker|family's|test|see)|.+

    REPLACE (?1$0)(?2\2|) ( Unchanged )

    You would get, this time, the following text :

    family's|
    homemaker|
    Line 3, without the two words
    homemaker|test|see|family's|
    Family's|test|see|homemaker|
    Line 6, without the two words, too !
    test|see|homemaker|
    homemaker|family's|homemaker|test|family's|see|family's|homemaker|
    

    Cheers,

    guy038


Log in to reply