Search for inconsistent line endings with a regex? (part 2)
-
Creating “part 2” because the original is now locked.
…So I love it when I do an internet search to find an answer to something and the first hit that comes up is me posting the same question 2 years ago…
…even better would have been a good answer to that question…but alas…
So with some new knowledge I will attempt my own new answer. I think I can
Search for inconsistent line endings with a regex
with the following:(?s).*?(?:(?=\r\n).*?(?:(?=\r[^\n])|(?=[^\r]\n)))|(?:(?=[^\r]\n).*?(?=\r))|(?:(?=\r[^\n]).*?(?=\n))
The goal with this is to search thru a Notepad++ editing tab buffer.
So the regex endeavors to work this way:
- pick off a first line ending (of any valid type)
- look further in the text for a line ending that doesn’t match the first type
If there is a match of the overall regex, I know that I have a line-ending “problem” in the current editing tab.
Comments? I’d love it if someone could expose holes in this. :)
-
…So I love it when I do an internet search to find an answer to something and the first hit that comes up is me posting the same question 2 years ago…
i love such events.
this can only be beaten by finding an answer to something you need today, and the best answer was written 2 years ago … by yourself ;-)ps, i remain curiously intrigued: who the hell locked that thread back then, and why ?
-
@Meta-Chuh said:
this can only be beaten
True that!
who the hell locked that thread back then, and why ?
My guess is that it was locked by “Father Time”! (i.e., age + inactivity)
-
Maybe I’ll shoot my own holes in it, for practical reasons:
editor.research(r'(?s).*?(?:(?=\r\n).*?(?:(?=\r[^\n])|(?=[^\r]\n)))|(?:(?=[^\r]\n).*?(?=\r))|(?:(?=\r[^\n]).*?(?=\n))' <type 'exceptions.RuntimeError'>: The complexity of matching the regular expression exceeded predefined bounds. Try refactoring the regular expression to make each choice made by the state machine unambiguous. This exception is thrown to prevent "eternal" matches that take an indefinite period time to locate.
-
BTW…I ended up doing this instead of a huge complex one-shot regex:
inconsistent_line_endings = False user_visible_text = editor.getTextRange(start_pos, end_pos) if '\r\n' in user_visible_text: if re.search(r'[^\r]\n', user_visible_text): inconsistent_line_endings = True if not inconsistent_line_endings: if re.search(r'\r[^\n]', user_visible_text): inconsistent_line_endings = True elif '\n' in user_visible_text: if '\r' in user_visible_text: inconsistent_line_endings = True
-
Hello @alan-kilborn and All,
Really, Alan, I think you’re making it difficult for yourself !!
So, given a file, if you just want to grab all the inconsistent “End of Lines” AND that you expect your file to be :
-
A Windows file, use the regex
\r(?!\n)|(?<!\r)\n
-
A Unix file, use the regex
\r\n|\r(?!\n)
-
A Mac file, use the regex
\r\n|(?<!\r)\n
If you just want to select all text from current position till the first inconsistent “End of Lines” AND that you expect your file to be :
-
A Windows file, use the regex
(?s).*?(\r(?!\n)|(?<!\r)\n)
-
A Unix file, use the regex
(?s).*?(\r\n|\r(?!\n))
-
A Mac file, use the regex
(?s).*?(\r\n|(?<!\r)\n)
Finally, if you prefer to highlight ( and bookmark ) all the lines, containing inconsistent “End of Lines” AND that you expect your file to be :
-
A Windows file, use the regex
(?-s).*(\r(?!\n)|(?<!\r)\n)
-
A Unix file, use the regex
(?-s).*(\r\n|\r(?!\n))
-
A Mac file, use the regex
(?-s).*(\r\n|(?<!\r)\n)
Best Regards
guy038
P.S. :
I’m surely missing something ! Why don’t you use the
Edit > EOL Conversion
option ? Whatever the number of EOL’s mixing in your current file, choose, firstly, any non-grayed option and secondly, choose your desired EOL => all the inconsistent EOL should have been changed into the expected type ;-)) -
-
Hello @guy038 !
Well, I didn’t really explain my goal, did I?
Recently I’ve been manipulating some data where I can get mixed line-endings in a file (initially). So I want to be “made aware” of the situation without having to do anything. So I’ve created a script whereby if mixed line-endings occur in one area (between
start_pos
andend_pos
, see script segment above) then I will, in code, turn visible line-endings ON. This way it hits me in the face, the current situation.Anyway, I don’t know the file type (Windows/Linux, Mac really isn’t happening) in advance so I can’t do stuff based upon that. And I can’t ask the “notepad” Python object what the current file format is anyway, because if I happen to have a WIndows file open in editor1/view0 and a Linux file open in editor2/view1, bad things happen to the logic.
Probably the above makes no real sense, but I’m just trying to say that I don’t think I’m making it hard on myself, I’m just doing what it takes for a solution. And I did arrive at that solution today; stuff is working fine now. :)
-
Hi, @alan-kilborn and All,
Yeeeeaaaah ! I succeeded to built a search regex which grabs all character(s) from the first
EOL
character(s) found to the nearest otherEOL
character(s), excluded, which is /are different from the first one ;-))I’m using the free-spacing mode for a best comprehension !
(?sx) ( \r\n .*? (?= (?<!\r)\n | \r(?!\n) ) | # From a TRUE \r\n...to....a TRUE (\n|\r) EXCLUDED (?<!\r)\n .*? (?= \r\n | \r(?!\n) ) | # From a TRUE \n.....to....a TRUE (\r\n|\r) EXCLUDED \r(?!\n) .*? (?= \r\n | (?<!\r)\n ) # From a TRUE \r.....to....a TRUE (\r\n|\n) EXCLUDED )
To test my regex, just paste, in a new N++ tab, the sample text, below ( the v
7.6
change.log, slightly modified ) :Notepad++ v7.6 new feature and bug-fixes:<CRLF> <CR> <CRLF> 1. Add Built-in Plugins Admins. Users can install, update and remove plugins by some clicks via Plugins Admin:<CRLF> https://notepad-plus-plus.org/features/plugin-admin.html<LF> 2. Change plugin loading method: Remove the legacy plugin loading way and apply only the new plugin loading method.<CRLF> 3. Add new message NPPM_GETPLUGINHOMEPATH in Notepad++ API for plugin, so plugin can get its path easily.<CR> 4. Fix a regression of performance issue while word wrap option is enable.<CRLF> 5. Fix a performance issue for switching back to folded document.<LF> <CR> 6. Fix crash issue due to Unix style path input in Open file dialog.<CR> 7. Fix UTF-8 detection problem: 4 byte characters UTF-8 character can be detected now.<CR> 8. Enhance/Fix encoding detection/problem.<CRLF> <LF> <CRLF> <CR> 9. Fix auto-indent issue by typing Enter on empty line.<LF> 10. Fix "Close all but this" behaviour if multiple views are present and some files are dirty.<CR> 11. Fix tool tip in document switcher showing the old name issue (after being renamed).<CRLF> <LF> <CR> <CRLF> <LF> <LF> <CRLF> <CRLF> <CR> 12. Add autoit and lua autoCompletion<CRLF> <CRLF> <CR> <CRLF> Included plugins:<CRLF> <CRLF> <LF> 1. NppExport v0.2.8 (32-bit x86 only)<LF> 2. Converter 4.2.1<LF> 3. Mime Tool 2.1<CRLF> 4. DSpellCheck 1.4.6<LF> <LF> <LF> <CR> Updater (Installer only):<CRLF> <CRLF> * WinGup (for Notepad++) v5.0.4<CRLF>
Then, with that regex S/R, below, we are going to change, first, in this new tab, the EOL characters to get a final text with all forms of line-breaks :
SEARCH
(?-i)(?:(<CRLF>)|(<LF>)|(<CR>))\R
REPLACE
(?1\r\n)(?2\n)(?3\r)
Now, you can play around, with my free-spacing regex above ;-))
Cheers,
guy038
P.S. :
If a file do not contain inconsistent
EOL
( i.e. if all the line-breaks, of current file, have the same form ) NO match occurred, as expected !! -
-
slightly different approach
regex_dict = {0:'\r[^\n]|[^\r]\n', 1:'\n', 2:'\r',} def check_eol(match): notepad.messageBox('Different EOLS detected','EOL Missmatch', 0) editor.research(regex_dict[editor.getEOLMode()], # regex check_eol, # function to call 0, # re flags 0, # start editor.getTextLength(), # end 1) # count
-
I either forgot about or never knew about
editor.getEOLMode()
. Perhaps I could have used that knowledge YESTERDAY before I finished my design the way I did. :(but Thank you.
-
So it appears the reason (maybe) that I never noticed
editor.getEOLMode()
before is that I “grew up” here on Community sample scripts where it seems thatnotepad.getFormatType()
was used much more frequently for a very similar purpose. On an integer basis the functions even return the same number values!I suppose the
notepad.getFormatType()
function is for what Notepad++ thinks the setting is for a file upon loading, and after that it follows the current user setting for “EOL conversion”…and theeditor.getEOLMode()
function usually follows thenotepad.getFormatType()
setting, but could be set independently (via PS code call toeditor.setEOLMode()
).I did verify this, the
editor.getEOLMode()
value follows thenotepad.getFormatType()
value, and if youeditor.setEOLMode()
to something different than the Notepad++ EOL setting, and then switch the active tab and then come back to the original tab,editor.getEOLMode()
will again be back at thenotepad.getFormatType()
setting. [A fair number of settings work this way: You can change them via PSeditor
functions, but a switch of tabs and a return will find them reset to original Notepad++ controlling values.]For my purposes, however, the
editor
function is valuable to know the setting foreditor1
andeditor2
, without, say, having to makeeditor2
the active editor–when it isn’t currently–and then callingnotepad.getFormatType()
.…if that all makes any kind of sense to you. :)
-
You are absolutely right and this is something one needs to keep in mind.
Whenever possible, a notepad object method should be used to stay in sync with npp. Npp itself does, as far as I understand the code, use SCI_SETEOLMODE and SCI_GETEOLMODE to set/get the current eol and as far as I have understood,
scintilla only checks the first line to determine the eol mode.I would say, to get a value it is safe to use editor object methods but, as said, if one wants to change something then notepad object methods should be preferred.
-
My guess is that it was locked by “Father Time”! (i.e., age + inactivity)
very funny … nooot
- no: topics don’t get locked automatically, when marked as solved.
- yes: topics have to be locked manually.
- no: this topic was not locked, to prevent follow up posts, in order to preserve it’s extraordinary state for eternity, like keeping an ancient ming vase empty.
- no: there was no content reason to lock this topic.
- no: the community place does not need a clean up, as the separate information exchange does not interfere with the issue tracker readability for developers.
- maybe: this was one of those topics, that got spammed back then, and was locked to contain it a bit.
-
@Ekopalypse said:
Whenever possible, a notepad object method should be used to stay in sync with npp
if one wants to change something then notepad object methods should be preferred
Very much agree. Usually the
notepad
object provides only get access, e.g. in this casenotepad.getFormatType()
has no correspondingnotepad.setFormatType()
. In order to do the set, one must do `notepad.menuCommand(MENUCOMMAND.FORMAT_TOUNIX) as an example. This is nice because it keeps the Notepad++ user interface consistent.Note a very similar discussion involving View -> Show Symbol -> … menu items via script control is found here: https://notepad-plus-plus.org/community/topic/14585/turn-on-off-the-line-ending-symbols-via-script
-
A stretch for staying on-topic, but I found a great way to set up a scenario for inconsistent line-endings, from the hand of @donho himself:
- Open a file with Unix (Linux) line-endings
- Select all (ctrl+a)
- Invoke Plugins -> Mime Tools -> Quoted-printable Encode
Boom. A very mixed line-endings file (Unix and Windows ends) now results.
(I discovered this when I was needing to mime a short file. I decided I don’t like how the Mime Tools plugin does its thing–not just this line-ending thing–and will resort to WinZip’s mime for my future miming needs.)