Finding sentences open with quotation marks and not closed



  • How do I find unclosed quotations? (ie: "Sentence starting with quotation marks and not ending with corresponding quotation marks like this)



  • Hello Vittorio,

    Let’s imagine the simple English text, below :

    He said "I'm glad to see you". Then, he invited me to come in, saying "Let's have a drink" and, also, "Make yourself at home" !
    
    • If you forget the end quotation after the word you you, still, can detect an unbalanced quotation mark, considering the first sentence, ending on the dot symbol.

    • But, if you forget, in the second sentence, both, the end quotation after the word drink and the start quotation, before the word Make, you, still, have a balanced quotation marks, which WON’T be detected !

    On the other hand, detecting unbalanced SINGLE quotation mark is very difficult, in English, because of the abbreviated forms ( as I’m, we’re…), possessive forms ( as Mary’s hat ), and so on… !


    But, given the hypothesis that there should be ONLY ONE balanced DOUBLE quotation marks, per sentence or per line, a regex, searching for an UNIQUE DOUBLE quotation symbol, can be built and could solve your problem !

    So, follow the fews steps below :

    • Open the Find dialog ( CTRL + F )

    • Set the Regular expression mode

    • In the Find what zone, type the regex (^|\.)[^"]*\K"(?=[^"]*(\.|$))

    • If necessary, check the Wrap around option

    • Click, once, on the Find Next button

    • Hit the ESC key to close the Find dialog

    • Go on, searching, in the downward direction, hitting the F3 key or in the upward direction , with the SHIFT + F3 shortcut

    This regex will select any UNIQUE DOUBLE quotation mark, found in each sentence, of a text.


    NOTES :

    • A sentence is supposed to begin at the beginning of a line ( ^ ) OR after a DOT symbol ( \. )

    • A sentence is supposed to end, at a DOT symbol ( \. ) OR at the end of a line ( $ )

    • The syntax [^"]* means any range of characters, even null, different from a double quotation mark.

    • The syntax (?=[^"]*(\.|$)) is a positive look-ahead, which MUST be verified, but that is NOT part of the regex. It means that, after finding the first " symbol, no other double quotation mark occur, till the end of the current sentence OR till the end of a the current line, if NO dot is found.

    • The \K syntax, before the double quotation mark, means that all which have been found, previous to the \K symbol, is forgotten and that the current regex to search is ONLY the double quotation mark, located between the \K form and the look-ahead (?=[^"]*(\.|$)).


    Of course, it will match, indifferently, the UNIQUE start OR the UNIQUE end double quotation mark, of a sentence. Afterwards, it’s up to you to appreciate at which location the missing double quotation should occur !

    For instance, do the search, on these two sentences, with unbalanced quotation marks, below :

    He said "I'm glad to see you.
    
    He said I'm glad to see you".
    

    Best Regards,

    guy038



  • Hi, Vittorio,

    Thinking again about this topic, I was able to improve, a bit, my previous search regex.

    With the regexes below, it’s possible to detect any ODD number of double quotation characters, ", in a sentence or, by default, in a complete line of text :-). Naturally, these new regexes seem rather tricky, but they do work !!

    The first regex, below, will select the last double quotation mark, NOT balanced in a sentence, or, by default, in a complete line :

    SEARCH (^|\.)(?:([^".\r\n]*)"(?2)")*(?2)\K"(?=(?2)(\.|$))

    NOTES :

    • The first group 1, (^|\.), represents a beginning of line or the possible dot of the previous sentence.

    • The group (?:([^".\r\n]*)"(?2)") represents any range, even null of well-balanced suites, of the form ....."..."..."......". Note that it’s a non-capturing group, due to the syntax ?:, at beginning of that group.

    • Therefore, the second group 2 is ([^".\r\n]*), inside the non-capturing group, which represents any range, even null, of characters, different from a double quotation character, a dot character and an EOL character.

    • The regex of this second group, is re-used, further, in the regex, as a called subroutine (?2) to that group 2. So, writing the syntax (?2) is exactly like writing the regex [^".\r\n]* !

    • And, like in my previous post, the final regex, searched, is the double quotation, only, after the \K syntax and before the look-ahead (?=(?2)(\.|$)), which looks a range of characters, not ", nor ., till the end of the sentence or the line.


    The second regex will stop at the beginning of any line or sentence, which contains an ODD number of double quotation characters :

    SEARCH (^|\.)\K(?=(?:([^".\r\n]*)"(?2)")*(?2)"(?2)(\.|$))

    NOTES :

    • This time, that second regex matches the empty string, located, between the a beginning of line ( or a dot of a previous sentence ) and a look-ahead, that tries to detect , FROM this current position, if there an odd number of double quotation marks, till the end of a sentence or a line !

    • So you’re immediately aware that there’s an unbalanced double quotation character, further on the current line :-)


    To see the behaviour of these two regexes, just do a test, on the simple subject text below :

    Line 1 "
    Line 2 ""
    Line 3 """
    Line 4 """"
    Line 5 """""
    Line 6 """""". "Second" "sentence
    
    • With the first regex, it should select the last " character of the lines 1, 3 and 5, only, and the ", just before the word sentence.

    • With the second regex, the cursor should be located, at beginning of the lines 1, 3 and 5, only, and just after the dot , on line 6.


    To end with :

    • You may, of course, change, in the regex, the double quotation mark by a single quotation mark, for instance. However, note that these regexes above, are NOT suitable, when the start and stop character are different, as for the couple ( and ) or even the French quotation marks and ! It’s an other story… )

    • If you don’t care about the notion of sentences, you can simplify these regexes, changing the anchor (^|\.) into ^ and the anchor (\.|$) into $

    Cheers,

    guy038


Log in to reply