Community
    • Login

    Replacing Duped Words across a block block of text, respecting {}

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    23 Posts 2 Posters 7.5k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • guy038G
      guy038
      last edited by guy038

      Hi @marc-Lalonde

      Use this regex, below :

      SEARCH (?s-i)\x20("?)(\w[\w ]*\w)\1(?=\x20(?:.+?\x20)?("?)\2\3\x20)

      Compared to my previous one, I’ve just changed the part (?-is) by the syntax (?s-i)


      So assuming your last example ( which needed some re-formatting, to be sure that names, with double quotes or not, are surrounded with space chars ! ), and with, BTW, 3 duplicates : 123, 321 and 852 !

      { 123 321 654 }
      { 123 951 753 "123" }
      { 456 852 "753 123" }
      { 123 "321" 852 }
      

      AFTER replacement, we get the new text below :

      { 654 }
      { 951 753 }
      { 456 "753 123" }
      { 123 "321" 852 }
      

      I hope, this last version is the good one ;-))

      Best Regards

      guy038

      1 Reply Last reply Reply Quote 0
      • Marc LalondeM
        Marc Lalonde
        last edited by

        Thats very nearly got it, just need to detect the closing bracket

        EG
        {
        Line 1: 123 654
        line 2: 321 456
        }
        {
        Line 1: 123 654
        Line 2: 582 456 123
        }

        Would only hit on the 123 pair in the second set

        Ie
        {
        Line 1: 123 654
        line 2: 321 456
        }
        {
        Line 1: 654
        Line 2: 582 456 123
        }

        1 Reply Last reply Reply Quote 0
        • Marc LalondeM
          Marc Lalonde
          last edited by

          I do have to say thank you though, if its not possible to get it to just look inside each set of { } as a unique group in one mass sweep. then at the very least its made the job that much more manageable. some 300 entries were caught by the scripts above, which is 300 i don’t have to look for. so thanks again there guy038

          1 Reply Last reply Reply Quote 0
          • guy038G
            guy038
            last edited by

            Hi, Marc,

            I don’t give up :-)) Just making numerous tests. It should be OK, very soon…

            guy038

            1 Reply Last reply Reply Quote 0
            • Marc LalondeM
              Marc Lalonde
              last edited by

              would it be better if you could see exactly what im trying to work with to make the script? I would be willing to link Via team-viewer (https://www.teamviewer.com/en/) so you have a better idea of exactly is needed, still got some 1100 duplicates still to hunt down. fyi im still stuck behind the 2 rep wall of only 20 mins between posts >.<

              1 Reply Last reply Reply Quote 0
              • guy038G
                guy038
                last edited by guy038

                Hello, @marc-Lalonde and All,

                I think that my new version should be very close to to your needs ! Just try it out :-)

                So, assuming the following hypotheses :

                • Names with one word of two word character, minimum or with several words, separated with, at least, one space character, possibly surrounded by double-quotes

                • Names are preceded with, at least, one space char or are located at the beginning of a line

                • Names are followed with , at least, one space char or are located at the end of a line

                • Each block of names is embedded in a {.........} block, in a single line or split on several lines

                • If a name is preceded of followed with a brace, one space character, at least, must separate them

                Then, the correct version for removing duplicates names of each block, only, should be :

                SEARCH (?s-i)(?:^|\x20+)("?)(\w[\w ]*\w)\1(?=(?:\x20+|\R)(?:[^{}]*(?:\x20+|\R))?("?)\2\3(?:\x20|\R))


                Remark :

                Compared to my previous version, this regex is more complex. Son for a best understanding, here is the equivalent version, with the free-spacing regex mode, which allows to insert non-significant space characters, in the regex !

                SEARCH (?xs-i) (?: ^|\x20+) ("?) (\w[\w ]*\w) \1 (?= (?: \x20+|\r?\n) (?: [^{}]* (?: \x20+|\r?\n) )? ("?) \2 \3 (?: \x20|\r?\n) )


                So given, for instance, the initial text below :

                {
                123 123 654
                 321 456
                 999 852 666 "852"
                 123
                }
                
                {
                 123 654 222 333 999
                 852 "999" 456 123
                 000 "123 654" "999"
                }
                
                {
                 "123 654" 555 "111"
                 852 111 "123" "000"
                 999 "333" "123 654" 000 333
                }
                
                {
                111 789
                789
                789 222
                333 789 444
                }
                {
                555 "789"
                "789"
                "789" 666
                777 "789" 888
                }
                
                {            3456      "3456"      3456       }
                
                { "6789" 6789 "6789" }
                
                {
                 "456" "12 34 56" 456 "123" "456" 123 "12 34 56" 789
                }
                
                

                We get the following text :

                {
                 654
                 321 456
                 999 666 "852"
                 123
                }
                
                {
                 222 333
                 852 456 123
                 000 "123 654" "999"
                }
                
                {
                 555
                 852 111 "123"
                 999 "123 654" 000 333
                }
                
                {
                111
                
                 222
                333 789 444
                }
                {
                555
                
                 666
                777 "789" 888
                }
                
                {      3456       }
                
                { "6789" }
                
                {
                 "456" 123 "12 34 56" 789
                }
                

                Waooooooo ! This regex totally drained me ;-))

                Cheers,

                guy038

                1 Reply Last reply Reply Quote 1
                • Marc LalondeM
                  Marc Lalonde
                  last edited by Marc Lalonde

                  That got almost All of my goal, one more small effort should finish this. Here is a screenshot of the full file structure, the only things i see it missing right now, is at the very start of the line, (Reference to the right screen/side) which is 3-4 tabs in. didnt think that would be an issue but its seeming to be. Other than that, it got all 19 alone in this section.

                  https://imgur.com/a/MO0dV

                  One hell of a job so far, this will get me able to finish this tonight likely (even with just this script part) myself and i very much appreciate the help.

                  1 Reply Last reply Reply Quote 0
                  • Marc LalondeM
                    Marc Lalonde
                    last edited by Marc Lalonde

                    At the very least, as i just ran the extended script above on all my files, 1100 errors down to all of 115 in one click. I owe you a drink :D

                    Only things its missing right now, are the start of lines, 2-3 tabs in (reference screenshot above) and ones with punctuation mid word. EG: Abu’l-Ghazi

                    But if its just 115 errors, i can handle that without much more work :D

                    Again i cant thank you enough, and ill be sharing this script with a fellow person having to deal with a very similar issue.

                    1 Reply Last reply Reply Quote 0
                    • guy038G
                      guy038
                      last edited by guy038

                      @marc-Lalonde and All,

                      Ah, I see ! So, I just changed the \x20+ syntax with the \h+ one, to include tabulations and No-Break space characters as possible separators

                      Secondly, in order to consider the apostrophe ' and the hyphen - as possible word character, I changed the syntax (\w[\w ]*\w) with the (\w[\w '-]*\w) one !

                      So, the final version of the regex is, from now on :

                      SEARCH (?s-i)(?:^|\h+)("?)(\w[\w '-]*\w)\1(?=(?:\h+|\R)(?:[^{}]*(?:\h+|\R))?("?)\2\3(?:\h|\R))

                      and, with the free-spacing mode, which allows to identify the different parts of this regex, it gives :

                      SEARCH (?xs-i) (?: ^|\h+) ("?) (\w[\w '-]*\w) \1 (?= (?: \h+|\r?\n) (?: [^{}]* (?: \h+|\r?\n) )? ("?) \2 \3 (?: \h|\r?\n) )


                      Just tell me if other characters, than the apostrophe, the hyphen and the space characters, may exist in your list of names :-))

                      Cheers

                      guy038

                      P.S. :

                      Note that, with the free-spacing regex (?x), the \R syntax is forbidden ! So, I changed it by the usual \r?\n syntax !

                      1 Reply Last reply Reply Quote 0
                      • Marc LalondeM
                        Marc Lalonde
                        last edited by

                        Luckily late last night i realized i somehow blanked a file relating to this so i had a fresh copy to hit with the most recent revision. It tagged 1600 dupes. Running my validator over it, catches four different occurrences it failed. totaling just 20 errors.

                        Instances it failed.
                        __
                        “Ko cheng”

                        location, start of line after tabs, its twin was second from end of same line
                        __
                        This one i imagine will be a bit tricky :/ (if even possible)

                        ZhanYong,

                        Location anywhere, its twin has the Y lowercase instead.
                        __
                        'Abd

                        Location anywhere, culprit probably punctuation at start
                        __
                        cont.

                        Location anywhere, Probably reverse reason as above
                        __
                        Cheers. and Ive spread the one from last night to a few people, they pass their thanks to you for this. it saves so much time

                        1 Reply Last reply Reply Quote 0
                        • guy038G
                          guy038
                          last edited by guy038

                          Hello, @marc-lalonde,

                          OK ! So, I changed the part of the regex , (\w[\w '-]*\w), responsible of matching the name, that is to be deleted. The new regex is ['.,]?(\w[\w '.-]*\w)['.,]?, which means that a name :

                          • Begins with a word character, possibly preceded by an apostrophe ( ' ), a dot ( . ) or a comma (,) symbols

                          • Contains, afterwards, a sequence, possibly empty of word characters or an apostrophe ( ' ) , a dot ( . ), an hyphen ( - ) or a space symbol

                          • And ends with with a word character, possibly followed by an apostrophe ( ' ), a dot ( . ) or a comma ( , ) symbols


                          Only the inner part, beginning and ending with a word character, is considered as the group 2, which must occur, further on, a second time. Note also, that names, with some leading or trailing symbols, may be surrounded, again, by double quotes, thanks to the syntax : ("?)['.,]?(\w[\w '.-]*\w)['.,]?\1

                          On the other hand, it’s important to point pout that the duplicate name matched, with the regex ("?)['.,]?\2['.,]?\3 :

                          • Can have leading or trailing symbols, different from the first occurrence, to be deleted

                          • Can be surrounded, or not, with double quotes, independently, too, from the first occurrence, to be deleted


                          To end with, the names, with a single double-quote ( as "xxxx or yyyy" ) are considered as invalid entities. Indeed, let’s suppose the initial text, below :

                          { 000 "123 555 456" 999 "123 456" 789 }
                          

                          If names must be surrounded with double-quotes, or not, we get the same text, as there is no duplicate :

                          { 000 "123 555 456" 999 "123 456" 789 }
                          

                          If names as "123 or 456" were allowed, we would get the wrong text, below :

                          { 000 555 999 "123 456" 789 }
                          

                          So Marc, the new regex, below, should, correctly, miss very few names ;-)) And, thus, get rid of the great majority of the duplicates !

                          SEARCH (?si)(?:^|\h+)("?)['.,]?(\w[\w '.-]*\w)['.,]?\1(?=(?:\h+|\R)(?:[^{}]*(?:\h+|\R))?("?)['.,]?\2['.,]?\3(?:\h|\R))


                          And, with the free-spacing mode, which allows to identify the different parts of this regex, it gives :

                          SEARCH (?xsi) (?: ^|\h+) ("?) ['.,]? ( \w[\w '.-]*\w ) ['.,]? \1 (?= (?: \h+|\r?\n) (?: [^{}]* (?: \h+|\r?\n) )? ("?) ['.,]? \2 ['.,]? \3 (?: \h|\r?\n) )


                          Oh, I forgot to say that the search, is, from now on, insensitive to the case, due to the modifiers syntax (?is), at beginning of the regex. So, assuming the text :

                          { ZhanYong Zhanyong }
                          

                          The first word would, indeed, be a duplicate of the second one ans, thus deleted !

                          Best Regards,

                          guy038

                          1 Reply Last reply Reply Quote 1
                          • Marc LalondeM
                            Marc Lalonde
                            last edited by

                            Since i finished my files, I just took the main original version and first ran it though the validator, 1876 errors, after running the script over it and validator. It got every single one. Only one Minor minor issue that doesn’t really have to be fixed, is it strips the last two closing } brackets at the very end of the file, that takes all of 5 seconds to re-add, i consider this a completed script. I very much appreciate the help for the last 24 hours, it probably saved me double if not more.

                            1 Reply Last reply Reply Quote 0
                            • Marc LalondeM
                              Marc Lalonde
                              last edited by Marc Lalonde

                              @guy038 For some reason this script has stopped working for me… its now just wiping the entire file.

                              Find what : (?si)(?:^|\h+)(“?)['.,]?(\w[\w '.-]\w)['.,]?\1(?=(?:\h+|\R)(?:[^{}](?:\h+|\R))?(”?)[‘.,]?\2[’.,]?\3(?:\h|\R))
                              replacing with nothing

                              wrap around
                              regular expression

                              and its blanking the file… (replace all: 1 occurrence was replaced)


                              nvm ?..? seems a extra space worked into the code…

                              1 Reply Last reply Reply Quote 0
                              • guy038G
                                guy038
                                last edited by

                                Hi, Marc and All,

                                IMHO, I supposed that your current file contains too much data ! I, very often, verified that complicated regexes totally fail when applied to huge amounts of text with the result that, only, one wrong match of all the file contents, occurs :-((

                                May be, try to slice your file in smaller parts ! It could help ?!

                                Generally, this problem often occurs when using recursion feature in regexes. But, it’s quite difficult to fully understand the limitations of the Boost regex engine, used in N++ !

                                Cheers,

                                guy038

                                1 Reply Last reply Reply Quote 0
                                • First post
                                  Last post
                                The Community of users of the Notepad++ text editor.
                                Powered by NodeBB | Contributors