• Login
Community
  • Login

Find and Display *All* Duplicate Lines

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
54 Posts 9 Posters 29.6k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • Y
    Yaron @Alan Kilborn
    last edited by Nov 20, 2023, 10:47 PM

    @Alan-Kilborn,

    Thank you for the beautiful script.

    @PeterJones,

    I don’t get Email notifications on new replies to threads I’m watching.
    Any idea?

    Thank you.

    7641557d-7f56-4b5e-8c60-32e90294c6be-תמונה.png

    - Logged in via GitHub.

    P 1 Reply Last reply Nov 20, 2023, 11:01 PM Reply Quote 0
    • M
      mkupper @Alan Kilborn
      last edited by Nov 20, 2023, 10:51 PM

      @Alan-Kilborn The script is pretty cool.

      I saw in the code that you support double-clicking and so played with that. One puzzle is that it intermittently creates random selections when I double click. For example, for one match I see:

      	Line 19360
      	Line 19364
      

      I double click on Line 19360 and am taken to that line in my text file. I do Ctrl+PageDown to flip to the DupeLineResults.sr file/tab and double click on Line 19364 expecting to be dropped down four lines. Instead the page starts with line 19360 and there is a selection running down to the middle of line 19398. Using the mouse and scroll bar I see that the selection starts started in the middle of line 991.

      At first I thought the cause was this particular file has a number of non-printable characters, extended Unicode characters, and illegal byte sequences. I created a copy of this file that is restricted to plain ASCII and tested again. The line numbers I noted above are from a plain ASCII file though it’s still classified as UTF-8 in the status line. A regexp search for [ -~\t\r\n]+ gets one hit which runs from the first line to the end which verifies it’s plain ASCII.

      The random selections are intermittent. Sometimes they happen, and sometimes not.

      This is on Notepad++ v8.6 (32-bit)
      Build time : Nov 18 2023 - 00:41:46
      Path : C:\Program Files (x86)\Notepad++\notepad++.exe
      Command Line : “c:\tmp\tmp”
      Admin mode : OFF
      Local Conf mode : OFF
      Cloud Config : OFF
      OS Name : Windows 11 Home (64-bit)
      OS Version : 22H2
      OS Build : 22621.2715
      Current ANSI codepage : 1252
      Plugins :
      DSpellCheck (1.5)
      mimeTools (2.9)
      NppConverter (4.5)
      NppExport (0.4)
      NppTextFX (0.2.6)
      PythonScript (2)

      A 1 Reply Last reply Nov 21, 2023, 12:25 PM Reply Quote 1
      • P
        PeterJones @Yaron
        last edited by Nov 20, 2023, 11:01 PM

        @Yaron said in Find and Display *All* Duplicate Lines:

        I don’t get Email notifications on new replies to threads I’m watching.
        Any idea?

        The email server stopped letting the forum connect and send messages some time ago. I haven’t felt that email notifications are worth pestering Don over

        Y 1 Reply Last reply Nov 20, 2023, 11:33 PM Reply Quote 2
        • Y
          Yaron @PeterJones
          last edited by Nov 20, 2023, 11:33 PM

          @PeterJones,

          It’s a basic functionality IMO. :)
          Thank you.

          P 1 Reply Last reply Nov 21, 2023, 12:59 AM Reply Quote 0
          • P
            PeterJones @Yaron
            last edited by PeterJones Nov 21, 2023, 1:01 AM Nov 21, 2023, 12:59 AM

            @Yaron said in Find and Display *All* Duplicate Lines:

            It’s a basic functionality IMO. :)

            Mail servers have harsh anti-spam for sending, and forum outgoing emails looks like spam to most servers. Without paying for an outgoing mail server that allows emails from forums, the chances are low that anything that we can do will counteract that. And Don isn’t going to pay for such a server.

            And as I pointed out in the FAQ back in the days when it didn’t work at all: the regulars (some, like you and I, who have been here for 8+ years, during which this feature worked for fewer than 8 months, so less than 1/12 of the time we’ve been involved in the forum), have survived the vast majority if the forum’s existence (in this format) without the email notifications – it’s nice to have, but IMO not a critical feature.

            I will contact him, but there is no guarantee we can get it to work, or that he’s interested in spending any more time on that feature of the forum.

            1 Reply Last reply Reply Quote 2
            • A
              Alan Kilborn @mkupper
              last edited by Nov 21, 2023, 12:25 PM

              @mkupper said in Find and Display *All* Duplicate Lines:

              One puzzle is that it intermittently creates random selections when I double click

              Hmm, I have not been able to duplicate this. I set up a test file of 25000 lines of random sentences (one per line), then I duplicated 10 of these lines, then random-sorted the file. Then I ran the script and tried to recreate what you are seeing from double-clicking in the results file, and could not. :-(

              M 1 Reply Last reply Nov 21, 2023, 6:21 PM Reply Quote 0
              • A
                Alan Kilborn
                last edited by Nov 21, 2023, 12:28 PM

                Side note:

                Someone asked me via chat why the output file has a .sr extension. The sr stands for “search results”; I tried to emulate Notepad++'s Search results format in the output I created.

                Another reason is that I have a UDL that I made for .sr files which colorizes the output somewhat like N++'s Search results, sample of that:

                ecbf68fa-be30-4548-a6c1-306f37364613-image.png

                M A 2 Replies Last reply Nov 22, 2023, 1:47 PM Reply Quote 1
                • G
                  guy038
                  last edited by guy038 Nov 21, 2023, 1:45 PM Nov 21, 2023, 1:43 PM

                  Hello, @yaron, @coises, @mkupper, @alan-kilborn and All,

                  @alan-kilborn, just a minor bug ( or a possible decision on your part ! ) :

                  Let’s write this text in a new tab :

                  This is a test
                  This is a test
                  
                      This is a second test
                      This is a second test
                  
                          This is a third test
                          This is a third test
                          This is a third test
                  
                  	This is a fouth test
                  	This is a fouth test
                  	This is a fouth test
                  	This is a fouth test
                  
                  		This is the last test
                  		This is the last test
                  		This is the last test
                  		This is the last test
                  		This is the last test
                  

                  Then, running your valuable script, we get this DupeLineResults.sr :

                  Search DUPLICATE LINES (5 sets) in "new 1"
                    Set-1 text: This is a test
                  	Line 1
                  	Line 2
                    Set-2 text: This is a second test
                  	Line 4
                  	Line 5
                    Set-3 text: This is a third test
                  	Line 7
                  	Line 8
                  	Line 9
                    Set-4 text: This is a fourth test
                  	Line 11
                  	Line 12
                  	Line 13
                  	Line 14
                    Set-5 text: This is the last test
                  	Line 16
                  	Line 17
                  	Line 18
                  	Line 19
                  	Line 20
                  

                  I personally was expecting :

                  Search DUPLICATE LINES (5 sets) in "new 1"
                    Set-1 text: This is a test
                  	Line 1
                  	Line 2
                    Set-2 text:     This is a second test
                  	Line 4
                  	Line 5
                    Set-3 text:         This is a third test
                  	Line 7
                  	Line 8
                  	Line 9
                    Set-4 text: 	This is a fourth test
                  	Line 11
                  	Line 12
                  	Line 13
                  	Line 14
                    Set-5 text: 		This is the last test
                  	Line 16
                  	Line 17
                  	Line 18
                  	Line 19
                  	Line 20
                  

                  Where the text, located after the regex string Set-\d+ text:\x20, should be identical to the text in current file


                  @mkupper, I think that :

                  • When you click on a line of the DupeLineResults.sr file in order to jump to curent file, it’s best to cancel any previous selection before doing the double-clic !

                  • Do NOT click on any line Set-### text: ......... Else, you get errors on the Python Console !

                  Best Regards,

                  guy038

                  A 2 Replies Last reply Nov 21, 2023, 1:56 PM Reply Quote 0
                  • A
                    Alan Kilborn @guy038
                    last edited by Nov 21, 2023, 1:56 PM

                    @guy038 said in Find and Display *All* Duplicate Lines:

                    Do NOT click on any line Set-### text: … Else, you get errors on the Python Console !

                    Yea. Bug. The line_in_source_file = line should be placed after the following if m: line.

                    1 Reply Last reply Reply Quote 1
                    • A
                      Alan Kilborn @guy038
                      last edited by Alan Kilborn Nov 21, 2023, 1:59 PM Nov 21, 2023, 1:58 PM

                      @guy038 said in Find and Display *All* Duplicate Lines:

                      just a minor bug ( or a possible decision on your part ! )

                      No bug. It was a design decision to NOT show leading whitespace on the duplicate line text.

                      If you don’t like it, easy enough to remove .lstrip() from where it occurs in the code.

                      1 Reply Last reply Reply Quote 2
                      • G
                        guy038
                        last edited by Nov 21, 2023, 4:56 PM

                        Hi, @alan-kilborn and All,

                        Sorry, Alan, but the two modifications fail :-((


                        Whatever the choice, I get the message :

                        Traceback (most recent call last):
                          File "D:\@@\792\plugins\Config\PythonScript\scripts\Test_Alan.py", line 89, in doubleclick_callback
                            if m:
                        AttributeError: 'NoneType' object has no attribute 'group'
                        

                        BTW, don’t rely, on the line number as I added some comments at the beginning of script


                        Deleting the .lstrip() string leave the lines Set-### text: ........ as they were before, without any change !

                        BR

                        guy038

                        A 1 Reply Last reply Nov 21, 2023, 6:08 PM Reply Quote 0
                        • A
                          Alan Kilborn @guy038
                          last edited by Alan Kilborn Nov 21, 2023, 6:19 PM Nov 21, 2023, 6:08 PM

                          @guy038 said in Find and Display *All* Duplicate Lines:

                          but the two modifications fail

                          Did you restart Notepad++ after making the modifications?


                          BTW, these lines control that:

                          if __name__ == '__main__':
                              try:
                                  fadadl
                              except NameError:
                                  fadadl = FADADL()
                              fadadl.run()
                          

                          The first time the script is run during a Notepad++ invocation (don’t want to use the word “session” here), fadadl doesn’t exist, so the NameError happens and the fadadl = FADADL() line executes (key part: registering the callback), and then the run method is called.

                          Because the callback is already there the second and later times the script is executed, the fadadl = FADADL() line does NOT execute; we don’t want multiple copies of the callback (otherwise a double-click on a Line x line would run the same callback code two or even more times). Thus any changes you’ve made to the code do not get picked up, and only the run method is called (with the code that was in place during the first execution).

                          There are other ways to accomplish these goals, but I’ve never fully trusted trying to remove an existing callback. I consider needing to restart Notepad++ a small limitation, because most users would just use the script, not modify it.

                          I hope this makes sense.

                          1 Reply Last reply Reply Quote 2
                          • M
                            mkupper @Alan Kilborn
                            last edited by Nov 21, 2023, 6:21 PM

                            @Alan-Kilborn and others.

                            I’m still getting and puzzling over the random selections issue. I don’t have any selections active and so for now have been trimming my test file down. At present it’s 787 lines of plain ASCII with no line longer than 100 characters.

                            As you and others are not seeing the issue with larger files I’m going to change tactics and set up a bare bones portable v8.6 x32. My day to day installed copy of Notepad++ has cruft from many years of upgrading and is also x32 for compatibility with a couple of obsolete plugins. It’s using PythonScript version 2 which I assume is still available.

                            A 1 Reply Last reply Nov 22, 2023, 12:13 PM Reply Quote 0
                            • G
                              guy038
                              last edited by guy038 Nov 21, 2023, 7:06 PM Nov 21, 2023, 6:54 PM

                              Hello, @alan-kilborn and All,

                              Oh…yes. Alan, I am unforgivable and deeply sorry ! Of course, everything worked fine after I stopped and restarted Notepad++. ;-))

                              Just a lesson : No good to be too lazy. Things must be run on the right order !

                              Best Regards,

                              guy038

                              P.S. :

                              Finally, I understand your design decision… once you’re no longer surprised to see several sets that seem to concern only one set !

                              1 Reply Last reply Reply Quote 2
                              • A
                                Alan Kilborn @mkupper
                                last edited by Alan Kilborn Nov 22, 2023, 12:18 PM Nov 22, 2023, 12:13 PM

                                @mkupper said in Find and Display *All* Duplicate Lines:

                                I’m still getting and puzzling over the random selections issue.

                                Is it only a “random selection” issue? That is, if you pretend the selection isn’t there, does your caret (at one end of the selection) end up on the correct line? If that’s the case, maybe we can simply cancel the selection.

                                It seems maybe there is something of a race going on here. The double-click makes Scintilla want to do something in the current (results) file – it wants to select a double-clicked word – but maybe that processing is somehow delayed until the source file is activated, and then bogus positions are used as Scintilla finishes making its selection? This goes against my understanding of how it should work but…

                                M 1 Reply Last reply Nov 22, 2023, 7:04 PM Reply Quote 0
                                • M
                                  Michael Vincent @Alan Kilborn
                                  last edited by Michael Vincent Nov 22, 2023, 1:48 PM Nov 22, 2023, 1:47 PM

                                  @Alan-Kilborn said in Find and Display *All* Duplicate Lines:

                                  output file has a .sr extension. The sr stands for “search results”; I tried to emulate Notepad++'s Search results format in the output I created.

                                  Another reason is that I have a UDL that I made for .sr files which colorizes the output

                                  A bit off topic for this current discussion, but Notepad++ has a “lexer” for “Internal Search” called “searchResult” that has entries in both ‘langs.model.xml’ and ‘stylers.model.xml’. You can see it in:

                                  https://github.com/notepad-plus-plus/notepad-plus-plus/blob/97dd708e233559a4a0bd819f7ee72892c27e1a66/PowerEditor/src/ScintillaComponent/ScintillaEditView.cpp#L116

                                  The “lexer” is set here:

                                  notepad-plus-plus/PowerEditor/src/ScintillaComponent/FindReplaceDlg.cpp:
                                  640: 	if (_scintView.execute(SCI_GETLEXER) == SCLEX_NULL)
                                  641: 	{
                                  642: 		_scintView.setLexer(L_SEARCHRESULT, LIST_NONE); // Restore searchResult lexer in case the lexer was changed to SCLEX_NULL in GotoFoundLine()
                                  643: 	}
                                  

                                  which through setLexer() eventually calls:

                                  notepad-plus-plus/PowerEditor/src/ScintillaComponent/ScintillaEditView.cpp:
                                  2165: bool ScintillaEditView::setLexerFromLangID(int langID) // Internal lexer only
                                  2166: {
                                  2167: 	if (langID >= L_EXTERNAL)
                                  2168: 		return false;
                                  2169: 
                                  2170: 	const char* lexerNameID = _langNameInfoArray[langID]._lexerID;
                                  2171: 	execute(SCI_SETILEXER, 0, reinterpret_cast<LPARAM>(CreateLexer(lexerNameID)));
                                  2172: 	return true;
                                  2173: }
                                  

                                  That functionality can be “duplicated” in PythonScript and in fact I have a “hidder Lexer” script based on work from @PeterJones and others that can enable some of Lexilla’s lexers that N++ does not expose. I’m wondering why we can’t just:

                                  from ctypes import windll, addressof, create_unicode_buffer
                                  from ctypes.wintypes import HWND, UINT, WPARAM, LPARAM
                                  
                                  from Npp import editor, notepad
                                  
                                  SendMessage          = windll.user32.SendMessageW;
                                  SendMessage.argtypes = [HWND, UINT, WPARAM, LPARAM]
                                  SendMessage.restype  = LPARAM
                                  
                                  NPPM_CREATELEXER = (1024 + 1000 + 110)
                                  
                                  _lexer     = create_unicode_buffer('searchResult')
                                  
                                  ilexer_ptr = SendMessage(notepad.hwnd, NPPM_CREATELEXER, 0, addressof(_lexer))
                                  editor.setILexer(ilexer_ptr)
                                  
                                  editor.colourise(0, -1)
                                  

                                  which (I hope) is the smallest self-contained example of what my hidder Lexer script does. I tried this but it does not lex the document as “searchResult” or “Internal Search”. I’m sure it’s a bit different in that I’m not trying to activate a Lexilla lexer, but this lexer name seems to be defined and used within N++, I wonder why PythonScript cannot activate it?

                                  Cheers.

                                  A 1 Reply Last reply Nov 23, 2023, 2:19 AM Reply Quote 2
                                  • M
                                    mkupper @Alan Kilborn
                                    last edited by PeterJones Nov 22, 2023, 7:24 PM Nov 22, 2023, 7:04 PM

                                    @Alan-Kilborn said in Find and Display *All* Duplicate Lines:

                                    Is it only a “random selection” issue? That is, if you pretend the selection isn’t there, does your caret (at one end of the selection) end up on the correct line? If that’s the case, maybe we can simply cancel the selection.

                                    When a selection happens I am taken to the middle of a page with the selection ending at the line I’m on. The intended target line is on the page, sometimes it’s top line that is visible, and sometimes the target line is at the at the bottom of the visible page.

                                    This item is unrelated to the random selection issue. I don’t know if it’s by design but sometimes double clicking takes me to a page with the target line at the very top and other times the target is at the very bottom of the selected page. This slows me down because after double clicking I need to then visually find the newly selected line. The choice of top or bottom seems random, even when retesting a line such as 6208.

                                    I now have two workarounds as I understand better what is happening with the random top/bottom combined with the intermittent appearance of a selection. When I get a selection I can go back to DupeLineResults.sr and double click again. The odds are it will work.

                                    Another workaround is to be mindful of the line number I intend to go to when double clicking in DupeLineResults.sr. If I get a selection then I know the line I want is either the first or last line on the page and so can look for it and move there without needing to re-run the double-click.

                                    The glitch is intermittent and can’t be solved by just cancelling the selection. However, a possible workaround in the code is to fetch the current line number. If you are not on the expected line then cancel the selection and re-run going to the target line.

                                    When I double click in DupeLineResults.sr I’m seeing that one of three things will happen:

                                    1. I am taken to the selected line which is positioned at the top of the page.
                                    2. I am taken to the selected line which is positioned at the bottom of the page. This seems to happen more often than being taken to the top of the page.
                                    3. I am taken to the middle of a page though trending towards the upper half and will have a selection running from the top of the page to my spot in the middle. The intended target line has always been either the first or last visible line on the page. The selection has always started far up the file and tends to be near the top.

                                    I don’t know if it matters but I run Notepad++ (and all apps) in full screen mode and have a single monitor. It lets me focus on what I’m working on. My fingers are well versed in the keystrokes needed to navigate among tabs of those apps that have tabs and are well versed in Alt-Tabbing to other apps.

                                    A 1 Reply Last reply Nov 22, 2023, 7:18 PM Reply Quote 0
                                    • A
                                      Alan Kilborn @mkupper
                                      last edited by Alan Kilborn Nov 22, 2023, 7:21 PM Nov 22, 2023, 7:18 PM

                                      @mkupper said in Find and Display *All* Duplicate Lines:

                                      sometimes it’s top line that is visible, and sometimes the target line is at the at the bottom of the visible page

                                      The problem you are experiencing seems to go deeper than this, but I will say that there isn’t a guarantee as to where in the viewport a line that you are moving to will appear.

                                      Scintilla documentation for .gotoLine for example, says, “…scrolls the view (if needed) to make it (the line) visible”.

                                      Programmers can pull extra duty to ensure that a line appears in the viewport where they want it to; example HERE.

                                      M 1 Reply Last reply Nov 22, 2023, 7:41 PM Reply Quote 2
                                      • M
                                        mkupper @Alan Kilborn
                                        last edited by PeterJones Nov 22, 2023, 8:49 PM Nov 22, 2023, 7:41 PM

                                        @Alan-Kilborn said in Find and Display *All* Duplicate Lines:

                                        The problem you are experiencing seems to go deeper than this, but I will say that there isn’t a guarantee as to where in the viewport a line that you are moving to will appear.

                                        Scintilla documentation for .gotoLine for example, says, “…scrolls the view (if needed) to make it (the line) visible”.

                                        I agree that the problem I’m experiencing is odd. I was thinking about what the FindAndDisplayAllDuplicateLines.py script could be racing with and so disabled TextFX which I found I had finally weeded myself from using. That did not help the selection issue.

                                        I tried to detect a pattern of the top or bottom of the viewport and did not see one while also getting intermittent selections. My current installation of Notepad++ seems to be nearly bare-bones. I think the only non-default plugin is PythonScript.

                                        The next step for me is to set up a fresh portable installation.

                                        P 1 Reply Last reply Nov 22, 2023, 8:55 PM Reply Quote 1
                                        • P
                                          PeterJones @mkupper
                                          last edited by Nov 22, 2023, 8:55 PM

                                          Moderator note: When typing names of files with two-letter extensions, some extensions map to known TopLevelDomains, which makes NodeBB linkify those filenames as URLs.

                                          So the “search results” .sr suffix is trying to linkify to domains assigned to Suriname, and .py suffix is trying to linkify domains assigned to Paraguay. Whether or not those domains actually exist, it’s not a good idea to link to them: spam bots that are crawling the web see links to non-existent domains, and they may try to buy those domains and put nefarious websites behind those links just to get a few more victims.

                                          I have red-texted the links I noticed in this discussion… but when you are previewing your post, if you see something in link color that you don’t expect, please go back and red-textify it, so that I don’t have to.

                                          A 1 Reply Last reply Nov 22, 2023, 9:55 PM Reply Quote 1
                                          30 out of 54
                                          • First post
                                            30/54
                                            Last post
                                          The Community of users of the Notepad++ text editor.
                                          Powered by NodeBB | Contributors