Community
    • Login

    Finding multiple words inside extra quotes

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    18 Posts 5 Posters 1.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • TexasKCFanT
      TexasKCFan
      last edited by

      Here is the results I get with the solution you gave me. The correct result along with incorrect results from the end of one line to the beginning of the next. I’m trying to figure out what to add so it won’t pick up the false results that include CRLF.

      fa27011d-1f83-44a7-9cef-7c81db2d9535-image.png

      1 Reply Last reply Reply Quote 0
      • Terry RT
        Terry R
        last edited by

        @TexasKCFan said in Finding multiple words inside extra quotes:

        Here is the results I get with the solution you gave me.

        Sorry it was not a solution. I was just pointing out that your requirement to exclude certain characters had been coded incorrectly, and how it might be coded. I didn’t do any tests.

        I haven’t been following this thread so don’t feel that I could provide one, others who have already presented some ideas might be the better ones to extend those solutions for your latest request.

        Terry

        1 Reply Last reply Reply Quote 0
        • TexasKCFanT
          TexasKCFan
          last edited by

          My apologies Terry I thought you were Peter. Thank you for the input!

          1 Reply Last reply Reply Quote 0
          • TexasKCFanT
            TexasKCFan
            last edited by

            (,"[^\r\n][^"]*)("[\w\s]*)(")([^"]*",)

            This is the solution that worked. I had to include the \r\n earlier in the line of code so it ignored the CRLF.
            This is now the desired result.

            5ad38658-3c7f-48a1-aac6-4f54d598e582-image.png

            Thank you @PeterJones and @Terry-R for your help!

            1 Reply Last reply Reply Quote 2
            • guy038G
              guy038
              last edited by guy038

              Hello @texaskcfan, @Peterjones, @terry-r and All,

              @texaskcfan, if I assume that :

              • Your list contains one record per line

              • The field separator is the comma, located right after the previous field and right before the next field

              • Any field is, either :

                • Any text, between two double quotes, beginning a line, possibly preceded with blanks characters and followed with a comma, as "xxxxxx",

                • Any text, between two double quotes, itself surrounded by two commas, as ,"xxxxxx",

                • Any text, between two double quotes, preceded by a comma and ending a line or the very end of current file, as ,"xxxxx"CRLF

              The following regex, containing a recursive pattern, will find all the zones between two commas, which contains an even number of " ( double-quote characters ) :

              (?x) (?:^\h*|,) (?: ( " (?: [^"\r\n,]++ | (?1) )* " ) )+? (?=,|\R|\z)

              For instance, given this sample, that you’ll copy in a new tab :

              "0","0","","","","","","","","","","","","",""
              
              "Field1","Field2","Field3","Field4","Field5","Field6","Field7","Field8","123 Main "St AptD"","","Dallas","TX","12345","","","","","","","","","","","","","","","","","","","","","","",""
              
              "Field1","Field2","Field3","Field4","Field5","Field6","Field7","Field8","This fie"ld is NOT correct","","Dallas","TX","12345","","","","","","","","","","","","","","","","","","","","","","",""
              
              
              ," abcde","","ijk   "123"," 987 "This is"a small""pie"""ce of"text for"  tests" !!","","abc"de"fgh"ij","12345"
              
              " abcde","","ijk   "123"," 987 "This is"a small""pie"""ce of"text for"  tests" !!","","abc"de"fgh"ij","12345"
              
              
              "","","",","","","",""
              
              ","","","","",""
              
              "   """    """   """    "
              "   " "    """   """    "
              
              1234567890","","","","",""
              
                    " abcde","","ijk   "123"," 987 "This is"a small""pie"""ce of"text for"  tests" !!","","abc"de"fgh"ij","12345"
              
              • Open the Mark dialog ( Ctrl + M )

              • Paste the above regex in the Find what: zone

              • Preferably, tick the Purge for each search option

              • Possibly, tick the Wrap around option

              • Select the Regular expression search mode

              • Click on the Mark All button

              => All the fields ,........, containing an even number of " ( thus any correct field ), are marked in red style. This means that any zone, still unmarked, contains an odd number of double quotes, probably indicating one " character, too many or too few ;-))

              Note that, for a correct scanning process of the text, the regex marks from the comma, before each zone "....." till the ", right before the next comma !

              You may, as well, use the Find dialog to visualize each correct field, one a a time !

              Here is a snapshot of the sample, after the mark operation :

              2c764ef9-1799-4107-ad48-dcdb09438eb4-image.png


              As you can see, all the fields non marked are incorrect in some ways and need examination.

              Note, also, that it’s impossible build up a regex in order to get the opposite logic, i.e. which would match the general case of any zone, between two commas, containing an odd number of double quotes inside !

              Best Regards

              guy038

              1 Reply Last reply Reply Quote 1
              • guy038G
                guy038
                last edited by guy038

                Hi, @texaskcfan, @Peterjones, @terry-r and All,

                @guy038 said :

                Note, also, that it’s impossible build up a regex in order to get the opposite logic, i.e. which would match the general case of any zone, between two commas, containing an odd number of double quotes inside !

                Well… I was wrong :-( As usual , just when I woke up, I understood the way to do ! We simply must add the two control verbs sequence (*SKIP)(*F) at the end of the previous regex and add an alternative which selects all standard chars sequence till the nearest comma , character ;-)), giving the final regex :

                MARK (?x) (?:^\h*|,) (?: ( " (?: [^"\r\n,]++ | (?1) )* " ) )+? (?=,|\R|\z) (*SKIP) (*F) | (?-s) .+? (?=,|\R|\z)

                With the same sample as in my previous post, we get, this time :

                0fccd582-5f14-4310-909c-5b8ac0dfc754-image.png


                Now, it’s even more easy to focus our attention on the problem areas !

                BR

                guy038

                1 Reply Last reply Reply Quote 3
                • guy038G
                  guy038
                  last edited by guy038

                  This post is deleted!
                  1 Reply Last reply Reply Quote 0
                  • guy038G
                    guy038
                    last edited by guy038

                    @texaskcfan, @peterjones, @terry-r and All,

                    I improved my regex in order to satisfy all the possible cases and get a coherent behavior :

                    MARK (?x) (?:^\h*|,) (?: ( " (?: [^"\r\n,]++ | (?1) )* " ) )+? (?=,|\R|\z) (*SKIP) (*F) | (?:^\h*|,) (?: [^"\r\n]+ (?=,|\R|\z) (*SKIP) (*F) | [^,\r\n]+? (?=,|\R|\z) )

                    So, to summarize, this regex will mark, from beginning of line, any possible comma and its next zone, till the nearest comma, excluded, or till the end of line / file, ONLY IF  this zone contains an odd number of double quotes "

                    Just use this sample, below, to test the regex against !

                    # With 0, 1, 2, 3, 4, 5 or 6 DOUBLE quote(s) and WITHOUT any char between DOUBLE-quotes
                    #  => Matches when 1, 3 or 5 DOUBLE quote(s) ( INCORRECT zones ) :
                    
                    ,,
                    ,",
                    ,"",
                    ,""",
                    ,"""",
                    ,""""",
                    ,"""""",
                    
                    # With 0, 1, 2, 3, 4, 5 or 6 DOUBLE quote(s) and with some WORD chars between DOUBLE-quotes
                    #  => Matches when 1, 3 or 5 DOUBLE quote(s) ( INCORRECT zones ) :
                     
                    ,,
                    ,",
                    ,"123456",
                    ,"123"456",
                    ,"12"34"56",
                    ,"12"34"56"78",
                    ,"12"34"56"78"90",
                    
                    ,ABCXYZ,
                    ,"ABCXYZ,
                    ,"ABCXYZ",
                    ,"ABC"XYZ",
                    ,"ABC""XYZ",
                    ,"ABC"""XYZ",
                    ,"ABC""""XYZ",
                    
                    ,ABCXYZ,
                    ,ABCXYZ",
                    ,"ABCXYZ",
                    ,"ABC"XYZ",
                    ,""ABCXYZ"",
                    ,""ABC"XYZ"",
                    ,"""ABCXYZ""",
                    
                    # With an EVEN number of DOUBLE quotes ( 4 ) for the STARTING, MIDDLE and ENDING field of a record
                    #  => CORRECT zones, NOT matched :
                    
                    "12"345678"90",
                    ,"123"4567"890",
                    ,"1234"56"7890"
                    
                    # With an ODD number of DOUBLE quotes ( 3 ) for the STARTING, MIDDLE and ENDING field of a record
                    #  => INCORRECT zones, matched :
                    
                    "12345"67890",
                    ,"12345678"90",
                    ,"12"34567890"
                    
                    # WITHOUT any DOUBLE-quote => All are CORRECT zones, so NOT matched ( indeed ZERO is an EVEN number ! )
                    
                    ,
                    ,,
                    ABCXYZ
                    ABCXYZ,
                    ,ABCXYZ,
                    ,ABCXYZ
                    

                    And we get the picture :

                    41b9daae-c76d-4a78-8e69-66e85aaa62a0-image.png

                    Best Regards,

                    guy038

                    TexasKCFanT 1 Reply Last reply Reply Quote 1
                    • TexasKCFanT
                      TexasKCFan @guy038
                      last edited by

                      @guy038 I can’t express how much I appreciate this additional detail! One search string to identify all inconsistencies would be a HUGE time saver. Since many of the files I work with have a couple of hundred thousand lines, my process is to identify the lines with issues, copy them and send them back to the client so they can correct them.

                      One note on the structure of my data…a successful line is 38 fields in this structure:

                      "Field 1", "Field 2","Field 3"CRLF
                      

                      Data wrapped in a set of double quotes is acceptable regardless of punctuation (a lot of foreign addresses have multiple periods and commas). The issues are caused by extra single or double quotes without the properly associated commas, incorrectly causing the fields to be misaligned. Either one word, or a group of words contains extra single or double quotes without the correctly placed comma creates extra fields. Single quotes can cause an also because the system’s looking for the non-existent matching quote.

                      "Field "1"",
                      "Fiel'd 1",
                      "Field 1 "but with additional spaces and characters" and extra set of quotes",
                      

                      When I run (?x) (?:^\h*|,) (?: ( " (?: [^"\r\n,]++ | (?1) )* " ) )+? (?=,|\R|\z) (*SKIP) (*F) | (?-s) .+? (?=,|\R|\z) I see that I’m getting non-error results like:

                      ,"123 Main St, Apt Q"
                      

                      While this example is not really correct (Apt Q should be in the next field), it doesn’t cause an issue per se because the whole thing is correctly enclosed in one set of double quotes and there are no extra quotes.

                      I think the expected error output would be something like this:

                      "123 Main "St", Apt Q"
                      "123 "Main St, Apt Q""
                      "123 Main S't, Apt Q"
                      

                      If every line starts with an opening quote, ends with a closing quote, followed by CRLF and all the quotes in between have matching quote with a comma delimiter then I have the correct number of fields and my only concern is when they put the data in the wrong fields which is beyond the scope of this post. :)

                      Is it possible to make the start and end of the search string double quotes instead of commas so we can identify extra quotes (single or double) that don’t belong?

                      I’m brand new to the regex world and I truly appreciate the assistance!

                      1 Reply Last reply Reply Quote 0
                      • guy038G
                        guy038
                        last edited by guy038

                        Hi, @texaskcfan, @peterjones, @terry-r and All,

                        Well, of course, my previous search is rather academic and not adapted to your own data. It just showed a way to search for an odd number of double quotes in fields defined as below :

                        • A field begins after a double quote at beginning of line or after the string ","

                        • This same field ends before a next string "," or before an end of line or the very end of current file

                        And my goal was to demonstrate that this kind of search can be achieved, only with recursive regexes. Indeed, if you lack this kind of regexes, you need to find out one regex which will match zones with 1 double quote ONLY, an other regex, which will match zones with 3 double quotes ONLY, and so on…


                        But, seemingly, it’s not exactly why you’re posting here !. However, I’m rather confused in some ways, by your last post :

                        • When you say :

                        I think the expected error output would be something like this:

                        “123 Main “St”, Apt Q”
                        “123 “Main St, Apt Q””
                        “123 Main S’t, Apt Q”

                        Do you mean that, given a single record of 8 fields :

                        CRLFField_1","Field_2","Field_3 with ERROR 1","Field_4","Field_5","Field_6 with ERROR 2","Field_7","Field_8"CRLF
                        

                        You expect this output :

                        .... PREVIOUS record ....
                        "Field_1","Field_2"CRLF
                        "Field_3 with ERROR 1"CRLF
                        "Field_4","Field_5"CRLF
                        "Field_6 with ERROR 2"CRLF
                        "Field_7","Field_8"CRLF
                        .... NEXT record ....
                        

                        In other words do you plan to isolate erroneous fields ?

                        Note that, at this time of the discussion, I don’t care about the errors themselves ! We’ll track them, later, by appropriate regexes !

                        • Secondly, does your sample text :
                        "123 Main "St", Apt Q"
                        "123 "Main St, Apt Q""
                        "123 Main S't, Apt Q"
                        

                        Represent 3 fields, one per line, delimited with the outer " characters ?

                        Represent several fields per line. For instance, "123 "Main St, Apt Q"" would have two fields 123 "Main St and Apt Q", separated with the comma ?

                        • Thirdly, if we consider any field, with delimiter = beginning or end of line or the string ",", can you show us the different types of strings, containing single and/or double quotes and/or commas, in between, which may occur and which are the ones that you would like to detect as errors ?

                        • More generally, which further treatment do you expect, when meeting these errors ?

                        @texaskcfan, it’s important to note that regexes are extremely sensitive to real data used and, although I give some generic regexes for general purposes, in this forum, a regular expression is generally adapted to a specific user and for specific data ;-))


                        I probably miss a lot of things about your data and your work-flow :-( So, just enlighten me !

                        See you later,

                        Best regards,

                        guy038

                        P.S. : The best would be that you provides a part of your file, containing a fairly range of records, which identifies the zones to be detected as erroneous !

                        Finally, if your data is rather confidential, just send me an e-mail at :

                        1 Reply Last reply Reply Quote 1
                        • TexasKCFanT
                          TexasKCFan
                          last edited by

                          @guy038 My apologies for the confusion. Thank you for the general knowledge for the community!

                          1. Given a single record of 8 fields, I would expect to see it exactly as you posted but minus the CRLF at the beginning and with an opening quote.

                          "Field_1","Field_2","Field_3 with ERROR 1","Field_4","Field_5","Field_6 with ERROR 2","Field_7","Field_8"CRLF

                          1. Yes, I plan to isolate erroneous fields. What I meant to say is these would be the three different examples of errors I would be able to identify in my data. My sample text represented three versions of errors I would find in a single field. A full line would be 38 fields in the format just included above.
                          “123 Main “St”, Apt Q” - Example of a single word incorrectly encased in double quotes `"St"`
                          “123 “Main St, Apt Q”” - Example of a phrase incorrectly encased in double quotes `"Main St, Apt Q"`
                          “123 Main S’t, Apt Q” - Example of an incorrect single quote `S't`
                          
                          1. Yes, my goal is to identify the errors so I can copy the rows and bring them to the client’s attention so the can fix them. I am not correcting the errors, simply identifying them during my pre-processing QA. If I process the files with these errors, the records will not load so I’m attempting to provide additional value to the client. :-)

                          I will be able to use the solutions above to identify the issues. I will put together a range of records for you to review and post later.

                          Thank you so much for your help! :-))

                          1 Reply Last reply Reply Quote 0
                          • prahladmifourP
                            prahladmifour
                            last edited by

                            Hello,@TexasKCFan
                            Please try this code, To Finding multiple words inside extra quotes
                            Regex:

                            /^[^"]*("[^"]*"[^"]*)*"[^"]*search[^"]*"/i
                            

                            I hope this code will be useful to you.
                            Thank you.

                            1 Reply Last reply Reply Quote 0
                            • guy038G
                              guy038
                              last edited by guy038

                              Hello, @texaskcfan, @peterjones, @terry-r and All,

                              While searching a solution, I realized that the comma delimiter can also be found inside fields themselves ! This is really problematic :(

                              So, in order to get a neat field delimiter, we should, temporarily, change the , delimeter by, let’s say, the ¤ character if this char is not used in your file ! From now on, I suppose that it’s the case :

                              Thus, we first change the field delimiter of your file with the simple normal S/R :

                              SEARCH ","

                              REPLACE "¤"

                              Note that you may replace the ¤ char with any char not used yet !

                              And, at the very end of the process, you’ll just perform this normal S/R to get the right syntax, again :

                              SEARCH ¤

                              REPLACE ,


                              Now, given your 3 error cases that you provided, here is my first try towards a complete solution ! I’m using the free-spacing mode (?x) for a better readability !

                              • MARK (?x) (?:^"|"¤") \K ([^"'¤\r\n]*?) ("|') (?1) \2? (?1) (?="¤"|"\R|"\z)    without the outer " chars marked

                              OR

                              • MARK (?x) (?:^|"¤) \K " ([^"'¤\r\n]*?) ("|') (?1) \2? (?1) " (?=¤"|\R|\z)    with the outer " chars marked

                              If we try these two regexes, against this sample text, below :

                              "Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"123 Main "St", Apt Q"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
                              "Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
                              "Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
                              "123 "Main St, Apt Q""¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
                              "Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
                              "Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
                              "Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"123 Main S't, Apt Q"
                              "Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
                              
                              "Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
                              "Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"123 Main "St, Apt Q"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
                              "Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
                              "Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
                              "123 Main St, Apt Q""¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
                              "Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
                              "Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
                              "Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"123 Main S't, A'pt Q"
                              "Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
                              
                              "Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
                              "123 "Main St, Apt Q""¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"123 Main "St", Apt Q"¤"Field_7"¤"Field_8"¤"Field_9"¤"123 Main S't, Apt Q"
                              "Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
                              

                              we get :

                              54839a2e-04a5-4f1e-8635-68209f9cb2af-image.png


                              • In that sample, I simply consider lines of 10 fields

                              • In lines 4, 7 and 10, you’ll recognize your 3 error cases

                              • In lines 14, 17 and 20, I deleted one double quote and added one single quote, which are also correctly detected as a possible error by the regex. But, If you do not need the detection of these extra cases, I won’t have any problem to modify the regex

                              • In line 24, I added a record with the 3 error cases, together

                              • From lines 29 to 50, results are identical to those above, but the outer double-quotes, containing fields, are also marked !


                              In this text, it’s very easy to move from one error’s zone to an other one :

                              • Use the F2 and the Shift + F2 shortcuts to move from one erroneous record to the next or previous one

                              • Use the Ctrl + 0 and Ctrl + Shift + 0 shortcuts, on the main keyboard, to move from one erroneous field to the next or previous one


                              You’ll remark, that I haven’t spoken yet, about a method to isolate all the erroneous fields, one per line. Indeed, @texaskcfan, I would like, first, to get your feeling about keeping records as they are ( excepted for the temporary field delimiter change ! ) and simply navigating between all error’s zones, with the shortcuts mentioned above !

                              Best Regards,

                              guy038

                              1 Reply Last reply Reply Quote 3
                              • TexasKCFanT
                                TexasKCFan
                                last edited by

                                @guy038 This solution is perfect and returns all the errors I am looking for. Thank you!

                                1 Reply Last reply Reply Quote 1
                                • First post
                                  Last post
                                The Community of users of the Notepad++ text editor.
                                Powered by NodeBB | Contributors