Community
    • Login

    Search for inconsistent line endings with a regex? (part 2)

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    29 Posts 5 Posters 5.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Alan KilbornA
      Alan Kilborn @Meta Chuh
      last edited by

      @Meta-Chuh said:

      this can only be beaten

      True that!

      who the hell locked that thread back then, and why ?

      My guess is that it was locked by “Father Time”! (i.e., age + inactivity)

      1 Reply Last reply Reply Quote 0
      • Alan KilbornA
        Alan Kilborn
        last edited by

        Maybe I’ll shoot my own holes in it, for practical reasons:

        editor.research(r'(?s).*?(?:(?=\r\n).*?(?:(?=\r[^\n])|(?=[^\r]\n)))|(?:(?=[^\r]\n).*?(?=\r))|(?:(?=\r[^\n]).*?(?=\n))'
        
        <type 'exceptions.RuntimeError'>:  The complexity of matching the regular expression exceeded predefined bounds.
        Try refactoring the regular expression to make each choice made by the state machine unambiguous.
        This exception is thrown to prevent "eternal" matches that take an indefinite period time to locate.
        
        Alan KilbornA 1 Reply Last reply Reply Quote 1
        • Alan KilbornA
          Alan Kilborn @Alan Kilborn
          last edited by

          BTW…I ended up doing this instead of a huge complex one-shot regex:

          inconsistent_line_endings = False
          user_visible_text = editor.getTextRange(start_pos, end_pos)
          if '\r\n' in user_visible_text:
              if re.search(r'[^\r]\n', user_visible_text): inconsistent_line_endings = True
              if not inconsistent_line_endings:
                  if re.search(r'\r[^\n]', user_visible_text): inconsistent_line_endings = True
          elif '\n' in user_visible_text:
              if '\r' in user_visible_text: inconsistent_line_endings = True
          
          1 Reply Last reply Reply Quote 1
          • guy038G
            guy038
            last edited by guy038

            Hello @alan-kilborn and All,

            Really, Alan, I think you’re making it difficult for yourself !!

            So, given a file, if you just want to grab all the inconsistent “End of Lines” AND that you expect your file to be :

            • A Windows file, use the regex \r(?!\n)|(?<!\r)\n

            • A Unix file, use the regex \r\n|\r(?!\n)

            • A Mac file, use the regex \r\n|(?<!\r)\n


            If you just want to select all text from current position till the first inconsistent “End of Lines” AND that you expect your file to be :

            • A Windows file, use the regex (?s).*?(\r(?!\n)|(?<!\r)\n)

            • A Unix file, use the regex (?s).*?(\r\n|\r(?!\n))

            • A Mac file, use the regex (?s).*?(\r\n|(?<!\r)\n)


            Finally, if you prefer to highlight ( and bookmark ) all the lines, containing inconsistent “End of Lines” AND that you expect your file to be :

            • A Windows file, use the regex (?-s).*(\r(?!\n)|(?<!\r)\n)

            • A Unix file, use the regex (?-s).*(\r\n|\r(?!\n))

            • A Mac file, use the regex (?-s).*(\r\n|(?<!\r)\n)

            Best Regards

            guy038

            P.S. :

            I’m surely missing something ! Why don’t you use the Edit > EOL Conversion option ? Whatever the number of EOL’s mixing in your current file, choose, firstly, any non-grayed option and secondly, choose your desired EOL => all the inconsistent EOL should have been changed into the expected type ;-))

            Alan KilbornA 1 Reply Last reply Reply Quote 2
            • Alan KilbornA
              Alan Kilborn @guy038
              last edited by

              Hello @guy038 !

              Well, I didn’t really explain my goal, did I?

              Recently I’ve been manipulating some data where I can get mixed line-endings in a file (initially). So I want to be “made aware” of the situation without having to do anything. So I’ve created a script whereby if mixed line-endings occur in one area (between start_pos and end_pos, see script segment above) then I will, in code, turn visible line-endings ON. This way it hits me in the face, the current situation.

              Anyway, I don’t know the file type (Windows/Linux, Mac really isn’t happening) in advance so I can’t do stuff based upon that. And I can’t ask the “notepad” Python object what the current file format is anyway, because if I happen to have a WIndows file open in editor1/view0 and a Linux file open in editor2/view1, bad things happen to the logic.

              Probably the above makes no real sense, but I’m just trying to say that I don’t think I’m making it hard on myself, I’m just doing what it takes for a solution. And I did arrive at that solution today; stuff is working fine now. :)

              1 Reply Last reply Reply Quote 3
              • guy038G
                guy038
                last edited by guy038

                Hi, @alan-kilborn and All,

                Yeeeeaaaah ! I succeeded to built a search regex which grabs all character(s) from the first EOL character(s) found to the nearest other EOL character(s), excluded, which is /are different from the first one ;-))

                I’m using the free-spacing mode for a best comprehension !

                (?sx)
                (
                  \r\n     .*?  (?=  (?<!\r)\n  |  \r(?!\n)  )  |  #  From a TRUE \r\n...to....a TRUE (\n|\r)   EXCLUDED
                (?<!\r)\n  .*?  (?=    \r\n     |  \r(?!\n)  )  |  #  From a TRUE \n.....to....a TRUE (\r\n|\r) EXCLUDED
                 \r(?!\n)  .*?  (?=    \r\n     | (?<!\r)\n  )     #  From a TRUE \r.....to....a TRUE (\r\n|\n) EXCLUDED
                )
                

                To test my regex, just paste, in a new N++ tab, the sample text, below ( the v7.6 change.log, slightly modified ) :

                Notepad++ v7.6 new feature and bug-fixes:<CRLF>
                <CR>
                <CRLF>
                1.  Add Built-in Plugins Admins. Users can install, update and remove plugins by some clicks via Plugins Admin:<CRLF>
                    https://notepad-plus-plus.org/features/plugin-admin.html<LF>
                2.  Change plugin loading method: Remove the legacy plugin loading way and apply only the new plugin loading method.<CRLF>
                3.  Add new message NPPM_GETPLUGINHOMEPATH in Notepad++ API for plugin, so plugin can get its path easily.<CR>
                4.  Fix a regression of performance issue while word wrap option is enable.<CRLF>
                5.  Fix a performance issue for switching back to folded document.<LF>
                <CR>
                6.  Fix crash issue due to Unix style path input in Open file dialog.<CR>
                7.  Fix UTF-8 detection problem: 4 byte characters UTF-8 character can be detected now.<CR>
                8.  Enhance/Fix encoding detection/problem.<CRLF>
                <LF>
                <CRLF>
                <CR>
                9.  Fix auto-indent issue by typing Enter on empty line.<LF>
                10. Fix "Close all but this" behaviour if multiple views are present and some files are dirty.<CR>
                11. Fix tool tip in document switcher showing the old name issue (after being renamed).<CRLF>
                <LF>
                <CR>
                <CRLF>
                <LF>
                <LF>
                <CRLF>
                <CRLF>
                <CR>
                12. Add autoit and lua autoCompletion<CRLF>
                <CRLF>
                <CR>
                <CRLF>
                Included plugins:<CRLF>
                <CRLF>
                <LF>
                1.  NppExport v0.2.8 (32-bit x86 only)<LF>
                2.  Converter 4.2.1<LF>
                3.  Mime Tool 2.1<CRLF>
                4.  DSpellCheck 1.4.6<LF>
                <LF>
                <LF>
                <CR>
                Updater (Installer only):<CRLF>
                <CRLF>
                * WinGup (for Notepad++) v5.0.4<CRLF>
                

                Then, with that regex S/R, below, we are going to change, first, in this new tab, the EOL characters to get a final text with all forms of line-breaks :

                SEARCH (?-i)(?:(<CRLF>)|(<LF>)|(<CR>))\R

                REPLACE (?1\r\n)(?2\n)(?3\r)

                Now, you can play around, with my free-spacing regex above ;-))

                Cheers,

                guy038

                P.S. :

                If a file do not contain inconsistent EOL ( i.e. if all the line-breaks, of current file, have the same form ) NO match occurred, as expected !!

                Alan KilbornA 1 Reply Last reply Reply Quote 2
                • Alan KilbornA
                  Alan Kilborn @guy038
                  last edited by

                  @guy038

                  Yeeeeaaaah

                  Glad you had some fun with it!

                  EkopalypseE 1 Reply Last reply Reply Quote 1
                  • EkopalypseE
                    Ekopalypse @Alan Kilborn
                    last edited by

                    @Alan-Kilborn

                    slightly different approach

                    regex_dict = {0:'\r[^\n]|[^\r]\n',
                                  1:'\n',
                                  2:'\r',}
                    
                    def check_eol(match):
                        notepad.messageBox('Different EOLS detected','EOL Missmatch', 0)
                    
                    editor.research(regex_dict[editor.getEOLMode()],    # regex
                                    check_eol,                          # function to call
                                    0,                                  # re flags
                                    0,                                  # start
                                    editor.getTextLength(),             # end
                                    1)                                  # count
                    
                    
                    Alan KilbornA Doug HartD 3 Replies Last reply Reply Quote 4
                    • Alan KilbornA
                      Alan Kilborn @Ekopalypse
                      last edited by

                      @Ekopalypse

                      I either forgot about or never knew about editor.getEOLMode(). Perhaps I could have used that knowledge YESTERDAY before I finished my design the way I did. :(

                      but Thank you.

                      1 Reply Last reply Reply Quote 3
                      • Alan KilbornA
                        Alan Kilborn @Ekopalypse
                        last edited by

                        @Ekopalypse

                        So it appears the reason (maybe) that I never noticed editor.getEOLMode() before is that I “grew up” here on Community sample scripts where it seems that notepad.getFormatType() was used much more frequently for a very similar purpose. On an integer basis the functions even return the same number values!

                        I suppose the notepad.getFormatType() function is for what Notepad++ thinks the setting is for a file upon loading, and after that it follows the current user setting for “EOL conversion”…and the editor.getEOLMode() function usually follows the notepad.getFormatType() setting, but could be set independently (via PS code call to editor.setEOLMode()).

                        I did verify this, the editor.getEOLMode() value follows the notepad.getFormatType() value, and if you editor.setEOLMode() to something different than the Notepad++ EOL setting, and then switch the active tab and then come back to the original tab, editor.getEOLMode() will again be back at the notepad.getFormatType() setting. [A fair number of settings work this way: You can change them via PS editor functions, but a switch of tabs and a return will find them reset to original Notepad++ controlling values.]

                        For my purposes, however, the editor function is valuable to know the setting for editor1 and editor2, without, say, having to make editor2 the active editor–when it isn’t currently–and then calling notepad.getFormatType().

                        …if that all makes any kind of sense to you. :)

                        EkopalypseE Meta ChuhM 2 Replies Last reply Reply Quote 3
                        • EkopalypseE
                          Ekopalypse @Alan Kilborn
                          last edited by

                          @Alan-Kilborn

                          You are absolutely right and this is something one needs to keep in mind.
                          Whenever possible, a notepad object method should be used to stay in sync with npp. Npp itself does, as far as I understand the code, use SCI_SETEOLMODE and SCI_GETEOLMODE to set/get the current eol and as far as I have understood,
                          scintilla only checks the first line to determine the eol mode.

                          I would say, to get a value it is safe to use editor object methods but, as said, if one wants to change something then notepad object methods should be preferred.

                          Alan KilbornA 1 Reply Last reply Reply Quote 4
                          • Meta ChuhM
                            Meta Chuh moderator @Alan Kilborn
                            last edited by Meta Chuh

                            @Alan-Kilborn

                            My guess is that it was locked by “Father Time”! (i.e., age + inactivity)

                            very funny … nooot

                            • no: topics don’t get locked automatically, when marked as solved.
                            • yes: topics have to be locked manually.
                            • no: this topic was not locked, to prevent follow up posts, in order to preserve it’s extraordinary state for eternity, like keeping an ancient ming vase empty.
                            • no: there was no content reason to lock this topic.
                            • no: the community place does not need a clean up, as the separate information exchange does not interfere with the issue tracker readability for developers.
                            • maybe: this was one of those topics, that got spammed back then, and was locked to contain it a bit.
                            1 Reply Last reply Reply Quote 0
                            • Alan KilbornA
                              Alan Kilborn @Ekopalypse
                              last edited by

                              @Ekopalypse said:

                              Whenever possible, a notepad object method should be used to stay in sync with npp

                              if one wants to change something then notepad object methods should be preferred

                              Very much agree. Usually the notepad object provides only get access, e.g. in this case notepad.getFormatType() has no corresponding notepad.setFormatType(). In order to do the set, one must do `notepad.menuCommand(MENUCOMMAND.FORMAT_TOUNIX) as an example. This is nice because it keeps the Notepad++ user interface consistent.

                              Note a very similar discussion involving View -> Show Symbol -> … menu items via script control is found here: https://notepad-plus-plus.org/community/topic/14585/turn-on-off-the-line-ending-symbols-via-script

                              1 Reply Last reply Reply Quote 2
                              • Alan KilbornA
                                Alan Kilborn
                                last edited by

                                A stretch for staying on-topic, but I found a great way to set up a scenario for inconsistent line-endings, from the hand of @donho himself:

                                • Open a file with Unix (Linux) line-endings
                                • Select all (ctrl+a)
                                • Invoke Plugins -> Mime Tools -> Quoted-printable Encode

                                Boom. A very mixed line-endings file (Unix and Windows ends) now results.

                                (I discovered this when I was needing to mime a short file. I decided I don’t like how the Mime Tools plugin does its thing–not just this line-ending thing–and will resort to WinZip’s mime for my future miming needs.)

                                Imgur

                                1 Reply Last reply Reply Quote 3
                                • Doug HartD
                                  Doug Hart @Ekopalypse
                                  last edited by

                                  @Ekopalypse said in Search for inconsistent line endings with a regex? (part 2):

                                  regex_dict = {0:‘\r[^\n]|[^\r]\n’,
                                  1:‘\n’,
                                  2:‘\r’,}

                                  def check_eol(match):
                                  notepad.messageBox(‘Different EOLS detected’,‘EOL Missmatch’, 0)

                                  editor.research(regex_dict[editor.getEOLMode()], # regex
                                  check_eol, # function to call
                                  0, # re flags
                                  0, # start
                                  editor.getTextLength(), # end
                                  1) # count

                                  I know this topic is ancient, but what exactly do I do with the sample code above? Is it supposed to be an external command, a configuration file, ?

                                  EkopalypseE 1 Reply Last reply Reply Quote 0
                                  • EkopalypseE
                                    Ekopalypse @Doug Hart
                                    last edited by

                                    @Doug-Hart

                                    there is a plugin called PythonScript that allows you to manipulate data in notepad++.

                                    Here are the steps on how to create and use it.

                                    The purpose of the script is to check whether the current document has different line endings (EOL), which can be problematic if you edit a file under Windows and then upload it to a Linux server, for example.

                                    1 Reply Last reply Reply Quote 4
                                    • guy038G
                                      guy038
                                      last edited by guy038

                                      Hello @ekopalypse, @alan-kilborn and All,

                                      @ekopalypse, I did not completely understand your script so I changed it and improved it as below :

                                      check = True
                                      
                                      false_EOL = {0:'$[^\r][^\n]',  # Miss the TWO chars \r\n at 'end of line' as editor.getEOLMode() = 0 ( Windows   EOL )
                                                   1:'\n',           # Should be \r                             as editor.getEOLMode() = 1 ( Macintosh EOL )
                                                   2:'\r',           # Should be \n                             as editor.getEOLMode() = 2 ( Unix      EOL )
                                                  }
                                      
                                      def check_eol(match):
                                          global check
                                          check = False
                                          notepad.messageBox('Different EOLS detected','EOL Mismatch', 0)
                                      
                                      editor.research(false_EOL[editor.getEOLMode()],     # regex to search for
                                                      check_eol,                          # function to call if regex match
                                                      0,                                  # re flags
                                                      0,                                  # START of file
                                                      editor.getLength(),                 # END   of file
                                                      1)                                  # count ( at FIRST match )
                                      
                                      if check == True:
                                          notepad.messageBox('All EOLS correct','EOL check', 0)
                                      

                                      Remarks :

                                      • I changed the word missmatch as mismatch which seems to be the right spelling !

                                      • I changed the name of the Python dictionnary from regex_dict to false_EOL. Thus, it emphasizes the wrong EOLS to match, in each case

                                      • I added a way to indicate when all the EOL are correct

                                      • Finally, I modified the regex used to dectect false EOLS when the file is supposed to be a Windows file

                                      So, I changed :

                                      false_EOL = {0:'\r[^\n]|[^\r]\n',   # Miss \n AFTER OR \r BEFORE as editor.getEOLMode() = 0 ( Windows   EOL )
                                      

                                      By :

                                      false_EOL = {0:'$[^\r][^\n]',  # Miss the TWO chars \r\n at 'end of line' as editor.getEOLMode() = 0 ( Windows   EOL )
                                      

                                      Because in case of huge files, the former syntax would lead to a RuntimeError regarding the regex. With the latter one, everything seems to work better !


                                      Now, to be sure that your file contains normalized EOLs only, simply run, consecutively, the two commands below :

                                      • For a Windows file :
                                      Edit > EOL conversion > Unix (LF)
                                      Edit > EOL conversion > Windows (CR LF)
                                      
                                      • For an Unix file :
                                      Edit > EOL conversion > Macintosh (CR)
                                      Edit > EOL conversion > Unix (LF)
                                      
                                      • For a Macintosh file :
                                      Edit > EOL conversion > Unix (LF)
                                      Edit > EOL conversion > Macintosh (CR)
                                      

                                      Best regards,

                                      guy038

                                      Alan KilbornA 1 Reply Last reply Reply Quote 1
                                      • Alan KilbornA
                                        Alan Kilborn @guy038
                                        last edited by Alan Kilborn

                                        @guy038 said in Search for inconsistent line endings with a regex? (part 2):

                                        Now, to be sure that your file contains normalized EOLs only, simply run, consecutively, the two commands below

                                        OR… have your script do it. Add these lines into your script, after the indicated existing lines:

                                        def check_eol(match):                                                  # <--- existing line in script
                                            global check                                                       # <--- existing line in script
                                            check = False                                                      # <--- existing line in script
                                            #notepad.messageBox('Different EOLS detected','EOL Mismatch', 0)   # <--- existing line in script, but now turned into a comment
                                            line_of_first_mismatch = editor.lineFromPosition(match.span(0)[1])
                                            notepad.messageBox('Different EOLS detected -- the first inconsistency is on line ' + str(line_of_first_mismatch),'EOL Mismatch', 0)
                                            user_input = notepad.prompt('Convert all line-endings in file?\r\nIf so, enter 0 for CRLF, 1 for CR, 2 for LF',
                                                'INCONSISTENT LINE-ENDINGS DETECTED!', editor.getEOLMode())
                                            if user_input is not None:
                                                desired_eol_index = int(user_input)
                                                if 0 <= desired_eol_index <= 2:
                                                    eol_cmd_list = [
                                                        MENUCOMMAND.FORMAT_TODOS,
                                                        MENUCOMMAND.FORMAT_TOMAC,
                                                        MENUCOMMAND.FORMAT_TOUNIX,
                                                    ]
                                                    if desired_eol_index == editor.getEOLMode():
                                                        notepad.menuCommand(eol_cmd_list[(desired_eol_index + 1) % 3])  # change to undesired line-endings
                                                    notepad.menuCommand(eol_cmd_list[desired_eol_index])  # change to desired line-endings
                                        

                                        Note also that I took the liberty of adding in some logic to tell you which line number has the first inconsistent line-ending.

                                        1 Reply Last reply Reply Quote 1
                                        • guy038G
                                          guy038
                                          last edited by guy038

                                          Hello, @alan-kilborn,

                                          I’ll study your last solution, on Monday 18 ( Again, I’m away on a three-day ski trip 😉 )

                                          Best Regards,

                                          guy038

                                          1 Reply Last reply Reply Quote 0
                                          • guy038G
                                            guy038
                                            last edited by

                                            Hello, @ekopalypse, @alan-kilborn, and All,

                                            Like you proposed, @alan-kilborn, the enhanced script becomes :

                                            check = True
                                            
                                            false_EOL = {0:'$[^\r][^\n]',  # Miss the TWO chars \r\n at 'end of line' as editor.getEOLMode() = 0 ( Windows   EOL )
                                                         1:'\n',           # Should be \r                             as editor.getEOLMode() = 1 ( Macintosh EOL )
                                                         2:'\r',           # Should be \n                             as editor.getEOLMode() = 2 ( Unix      EOL )
                                                        }
                                            
                                            def check_eol(match):
                                                global check
                                                check = False
                                                line_of_first_mismatch = editor.lineFromPosition(match.span(0)[1])
                                                notepad.messageBox('Different EOLS detected -- the first inconsistency is on line ' + str(line_of_first_mismatch),'EOL Mismatch', 0)
                                                user_input = notepad.prompt('Convert all line-endings in file?\r\nIf so, enter 0 for CRLF, 1 for CR, 2 for LF',
                                                    'INCONSISTENT LINE-ENDINGS DETECTED!', editor.getEOLMode())
                                                if user_input is not None:
                                                    desired_eol_index = int(user_input)
                                                    if 0 <= desired_eol_index <= 2:
                                                        eol_cmd_list = [
                                                            MENUCOMMAND.FORMAT_TODOS,
                                                            MENUCOMMAND.FORMAT_TOMAC,
                                                            MENUCOMMAND.FORMAT_TOUNIX,
                                                        ]
                                                        if desired_eol_index == editor.getEOLMode():
                                                            notepad.menuCommand(eol_cmd_list[(desired_eol_index + 1) % 3])  # change to UNDESIRED line-endings
                                                        notepad.menuCommand(eol_cmd_list[desired_eol_index])                # change to DESIRED   line-endings
                                            
                                            editor.research(false_EOL[editor.getEOLMode()],     # regex to search for
                                                            check_eol,                          # function to call if regex match
                                                            0,                                  # re flags
                                                            0,                                  # START of file
                                                            editor.getLength(),                 # END   of file
                                                            1)                                  # count ( at FIRST match )
                                            
                                            if check == True:
                                                notepad.messageBox('All EOLS correct','EOL check', 0)
                                            

                                            Now, given this simple text :

                                            This
                                            is
                                            a
                                            little
                                            test
                                            to   
                                            try
                                            if
                                            OK
                                            
                                            • With Windows (CR LF) in the status bar

                                            • With line 4 ending with CR

                                            • line 6 ending with 3 spaces + LF

                                            • And all the other lines ending with CRLF

                                            When running the script, it said :

                                            Different EOLS detected -- The first inconsistency is on line 6, although it should be on line 4 ending with CR !


                                            Still searching for other oddities :-)

                                            Best Regards,

                                            guy038

                                            Alan KilbornA 1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors