• Login
Community
  • Login

sort file removing duplicates possible?

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
75 Posts 5 Posters 44.6k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • C
    Claudia Frank
    last edited by Claudia Frank Jun 2, 2018, 2:50 AM Jun 2, 2018, 2:49 AM

    You can find the 1st version of the script here .

    In order to make it run there are two requirements, apart from the obvious one that you need to have python script plugin installed, which needs to be full-filled.

    1.) be sue you have either installed the full package or download and unzip the TclTk into the NPP_INSTALL_DIR. Latest releases

    2.) in order to make the “accent insensitive” feature working it is needed to install a python library called unidecode .
    Unzip the .whl package into NPP_INSTALL_DIR\plugins\lib\

    To check both requirements, open the python script console and do the following commands

     import Tkinter
     import unidecode
    

    If you don’t see any errors - done.
    Usage is simple - run the script and check the different options.

    What should work is

    • sort/delete duplicates on whole text (aka nothing is selected)
    • sort/delete duplicates on vertically selected text

    not supported yet:

    • sort/delete duplicates on rectangular selection

    Cheers
    Claudia

    Btw. I spent most of the time creating this ugly window - so if someone wants to create a nicer gui - please go for it. I’m not really good in designing UIs.

    1 Reply Last reply Reply Quote 2
    • P
      patrickdrd
      last edited by Jun 2, 2018, 7:13 AM

      tkinter doesn’t work:

      Python 2.7.6-notepad++ r2 (default, Apr 21 2014, 19:26:54) [MSC v.1600 32 bit (Intel)]
      Initialisation took 219ms
      Ready.

      import Tkinter
      Traceback (most recent call last):
      File “<console>”, line 1, in <module>
      ImportError: No module named Tkinter
      Traceback (most recent call last):
      File “D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\scripts\Sorter.py ”, line 5, in <module>
      import Tkinter as tk
      ImportError: No module named Tkinter

      1 Reply Last reply Reply Quote 0
      • P
        patrickdrd
        last edited by Jun 2, 2018, 8:39 AM

        I found out something else, about that easylist file,
        textfx’s case insensitive sort results in 69234 entries,
        which is the same as ue’s result!

        S 1 Reply Last reply Jun 2, 2018, 12:23 PM Reply Quote 0
        • G
          guy038
          last edited by guy038 Jun 2, 2018, 10:43 AM Jun 2, 2018, 10:36 AM

          Hello, @patrickdrd, and All,

          Well, I must complete my previous post !

          • Firstly, I realized that your list, below, is constantly updated ( Last modified: 02 Jun 2018 08:09 UTC )

          https://easylist.to/easylist/easylist.txt

          So, today, this list contains 69917 lines


          • Secondly, when performing the regex S/R, we must consider, both, sensitive and insensitive search => The two search regexes :

          Regex A : (?-is)(^.+\R)\1+

          Regex B : (?i-s)(^.+\R)\1+

          give, after sorting and removing duplicates with the regex, a file containing :

          A 69852 lines ( so, 65 lines deleted, in 56 matches )

          B 69817 lines ( so, 100 lines deleted, in 88 matches )


          • Thirdly, we, also, must take in account the possibility that the sort, itself, is run in a sensitive or insensitive way !

          Natively, Notepad++ sort text, according to the Unicode value ( code-point ) of characters ( a kind of sensitive sort ! ) whereas some other text editors may consider these two case options, leading to different results !

          For instance, using the RJ TextEd software, here are the differences with a simple list of three-characters strings ( 1 x ‘ABC’, 2 x ‘AbC’, 3 x ‘Abc’,1 x ‘DEF’, 3 x ‘DEf’, 2 x ‘aBC’, 3 x ‘aBc’, 3 x ‘dEF’ and 3 x ‘def’ )

                          •-----------------------•---------------------------•
                          |    with Notepad++     |      with RJ TextEd       |
          •---------------•-----------------------•---------------------------•
          |  Before Sort  |  After UNICODE Sort   |   After SENSITIVE Sort    |
          •---------------•-----------------------•---------------------------•
          |      Abc      |          ABC          |            AbC            |
          |      Abc      |          AbC          |            aBC            |
          |      DEf      |          AbC          |            aBc            |
          |      AbC      |          Abc          |            Abc            |
          |      aBc      |          Abc          |            ABC            |
          |      dEF      |          Abc          |            aBC            |
          |      aBc      |          DEF          |            Abc            |
          |      DEf      |          DEf          |            AbC            |
          |      aBC      |          DEf          |            aBc            |
          |      def      |          DEf          |            aBc            |
          |      AbC      |          aBC          |            Abc            |
          |      def      |          aBC          |            DEf            |
          |      def      |          aBc          |            dEF            |
          |      ABC      |          aBc          |            dEF            |
          |      Abc      |          aBc          |            DEf            |
          |      DEF      |          dEF          |            dEF            |
          |      aBc      |          dEF          |            DEf            |
          |      dEF      |          dEF          |            def            |
          |      dEF      |          def          |            DEF            |
          |      DEf      |          def          |            def            |
          |      aBC      |          def          |            def            |
          •---------------•-----------------------•---------------------------•
          

          So, it’s easy to understand that removing consecutive duplicates, after the sort, with the regexes above, will, necessarily, give results totally different, depending of the software used :-(


          • Fourthly, sort may give different results, after being run several times, one after another. For instance, with RJ TextEd, running 6 times the insensitive sort on the 3 character list above, I was left with 3 sets of data ( 4 times, identical to the sensitive sort and two other lists !! Luckily, as for Notepad++, its Unicode sort always give identical results :-))

          That’s why, @patrickdrd, it’s very difficult, finally, to compare results between different softwares, at each piece have its own behavior !

          Cheers,

          guy038

          P.S. :

          Here are the results of my tests :

          1) With Notepad++ and RJ TextEd, using sensitive sort :

          •---------------------------------------------------------------------------------------------------•
          |                                      with Notepad++ Features                                      |
          •---------------•-------------------------•---------------------------•-----------------------------•
          |  Before Sort  |  After Sensitive Sort   |  After Sensitive Regex +  |  After INsensitive Regex +  |
          |               |                         |  Suppression Duplicates   |   Suppression Duplicates    |
          •---------------•-------------------------•---------------------------•-----------------------------•
          |      Abc      |           ABC           |            ABC            |             ABC             |
          |      Abc      |           AbC           |            AbC            |             DEF             |
          |      DEf      |           AbC           |            Abc            |             aBC             |
          |      AbC      |           Abc           |            DEF            |             dEF             |
          |      aBc      |           Abc           |            DEf            |                             |
          |      dEF      |           Abc           |            aBC            |                             |
          |      aBc      |           DEF           |            aBc            |                             |
          |      DEf      |           DEf           |            dEF            |                             |
          |      aBC      |           DEf           |            def            |                             |
          |      def      |           DEf           |                           |                             |
          |      AbC      |           aBC           |                           |                             |
          |      def      |           aBC           |                           |                             |
          |      def      |           aBc           |                           |                             |
          |      ABC      |           aBc           |                           |                             |
          |      Abc      |           aBc           |                           |                             |
          |      DEF      |           dEF           |                           |                             |
          |      aBc      |           dEF           |                           |                             |
          |      dEF      |           dEF           |                           |                             |
          |      dEF      |           def           |                           |                             |
          |      DEf      |           def           |                           |                             |
          |      aBC      |           def           |                           |                             |
          •---------------•-------------------------•---------------------------•-----------------------------•
          
          
          •---------------------------------------------------------------------------------------------------•
          |                                       with RJ TextEd Features                                     |
          •---------------•-------------------------•---------------------------•-----------------------------•
          |  Before Sort  |  After Sensitive Sort   |  After Sensitive Regex +  |  After INsensitive Regex +  |
          |               |                         |  Suppression Duplicates   |   Suppression Duplicates    |
          •---------------•-------------------------•---------------------------•-----------------------------•
          |      Abc      |           AbC           |            AbC            |             AbC             |
          |      Abc      |           aBC           |            aBC            |             DEf             |
          |      DEf      |           aBc           |            aBc            |                             |
          |      AbC      |           Abc           |            Abc            |                             |
          |      aBc      |           ABC           |            ABC            |                             |
          |      dEF      |           aBC           |            aBC            |                             |
          |      aBc      |           Abc           |            Abc            |                             |
          |      DEf      |           AbC           |            AbC            |                             |
          |      aBC      |           aBc           |            aBc            |                             |
          |      def      |           aBc           |            Abc            |                             |
          |      AbC      |           Abc           |            DEf            |                             |
          |      def      |           DEf           |            dEF            |                             |
          |      def      |           dEF           |            DEf            |                             |
          |      ABC      |           dEF           |            dEF            |                             |
          |      Abc      |           DEf           |            DEf            |                             |
          |      DEF      |           dEF           |            def            |                             |
          |      aBc      |           DEf           |            DEF            |                             |
          |      dEF      |           def           |            def            |                             |
          |      dEF      |           DEF           |                           |                             |
          |      DEf      |           def           |                           |                             |
          |      aBC      |           def           |                           |                             |
          •---------------•-------------------------•---------------------------•-----------------------------•
          

          2) When running, several times, an insensitive sort, with RJ TextEd, I obtained 3 different lists :

          • The first one was identical to the table just above, which uses a sensitive sort

          • The two others are listed below !

          •---------------------------------------------------------------------------------------------------•
          |                                       with RJ TextEd Features                                     |
          •---------------•-------------------------•---------------------------•-----------------------------•
          |  Before Sort  |  After INsensitive Sort |  After Sensitive Regex +  |  After INsensitive Regex +  |
          |               |                         |  Suppression Duplicates   |   Suppression Duplicates    |
          •---------------•-------------------------•---------------------------•-----------------------------•
          |      Abc      |          ABC            |            ABC            |             ABC             |
          |      Abc      |          AbC            |            AbC            |             dEF             |
          |      DEf      |          aBC            |            aBC            |                             |
          |      AbC      |          aBC            |            aBc            |                             |
          |      aBc      |          aBc            |            Abc            |                             |
          |      dEF      |          Abc            |            AbC            |                             |
          |      aBc      |          AbC            |            aBc            |                             |
          |      DEf      |          aBc            |            Abc            |                             |
          |      aBC      |          Abc            |            aBc            |                             |
          |      def      |          Abc            |            dEF            |                             |
          |      AbC      |          aBc            |            DEf            |                             |
          |      def      |          dEF            |            def            |                             |
          |      def      |          dEF            |            dEF            |                             |
          |      ABC      |          DEf            |            DEF            |                             |
          |      Abc      |          DEf            |            def            |                             |
          |      DEF      |          DEf            |                           |                             |
          |      aBc      |          def            |                           |                             |
          |      dEF      |          def            |                           |                             |
          |      dEF      |          dEF            |                           |                             |
          |      DEf      |          DEF            |                           |                             |
          |      aBC      |          def            |                           |                             |
          •---------------•-------------------------•---------------------------•-----------------------------•
          
          
          •---------------------------------------------------------------------------------------------------•
          |                                       with RJ TextEd Features                                     |
          •---------------•-------------------------•---------------------------•-----------------------------•
          |  Before Sort  |  After INsensitive Sort |  After Sensitive Regex +  |  After INsensitive Regex +  |
          |               |                         |  Suppression Duplicates   |   Suppression Duplicates    |
          •---------------•-------------------------•---------------------------•-----------------------------•
          |      Abc      |          ABC            |           ABC             |            ABC              |
          |      Abc      |          AbC            |           AbC             |            dEF              |
          |      DEf      |          aBC            |           aBC             |                             |
          |      AbC      |          aBC            |           aBc             |                             |
          |      aBc      |          aBc            |           Abc             |                             |
          |      dEF      |          Abc            |           aBc             |                             |
          |      aBc      |          aBc            |           AbC             |                             |
          |      DEf      |          AbC            |           Abc             |                             |
          |      aBC      |          Abc            |           aBc             |                             |
          |      def      |          Abc            |           dEF             |                             |
          |      AbC      |          aBc            |           DEf             |                             |
          |      def      |          dEF            |           def             |                             |
          |      def      |          dEF            |           DEF             |                             |
          |      ABC      |          dEF            |           DEf             |                             |
          |      Abc      |          DEf            |                           |                             |
          |      DEF      |          DEf            |                           |                             |
          |      aBc      |          def            |                           |                             |
          |      dEF      |          def            |                           |                             |
          |      dEF      |          def            |                           |                             |
          |      DEf      |          DEF            |                           |                             |
          |      aBC      |          DEf            |                           |                             |
          •---------------•-------------------------•---------------------------•-----------------------------•
          
          1 Reply Last reply Reply Quote 1
          • C
            Claudia Frank
            last edited by Jun 2, 2018, 12:22 PM

            Patrick, did you downlaod and unzip the TclTk into the
            NPP_INSTALL_DIR ? (in your case into D:\Utilities\PortableApps\Notepad++)

            If so, can you run the following in the python script console

            import sys; print '\n'.join(sys.path) 
            

            and post the output?

            Did the unidecode library installation work?

            Cheers
            Claudia

            1 Reply Last reply Reply Quote 0
            • S
              Scott Sumner @patrickdrd
              last edited by Jun 2, 2018, 12:23 PM

              This post is deleted!
              1 Reply Last reply Reply Quote 0
              • P
                patrickdrd
                last edited by Jun 2, 2018, 1:50 PM

                yes, unidecode works fine, import command works

                C 1 Reply Last reply Jun 2, 2018, 1:59 PM Reply Quote 0
                • C
                  Claudia Frank @patrickdrd
                  last edited by Jun 2, 2018, 1:59 PM

                  @patrickdrd

                  what does the sys.path report?

                  Cheers
                  Claudia

                  1 Reply Last reply Reply Quote 0
                  • P
                    patrickdrd
                    last edited by Jun 2, 2018, 2:43 PM

                    D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\lib
                    D:\Utilities\PortableApps\Notepad++\plugins\Config\PythonScript\lib
                    D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\scripts
                    D:\Utilities\PortableApps\Notepad++\plugins\Config\PythonScript\scripts
                    D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\lib\lib-tk
                    D:\Utilities\PortableApps\Notepad++\python27.zip
                    D:\Utilities\PortableApps\Notepad++\DLLs
                    D:\Utilities\PortableApps\Notepad++\lib
                    D:\Utilities\PortableApps\Notepad++\lib\plat-win
                    D:\Utilities\PortableApps\Notepad++\lib\lib-tk
                    D:\Utilities\PortableApps\Notepad++

                    C 1 Reply Last reply Jun 2, 2018, 3:04 PM Reply Quote 0
                    • C
                      Claudia Frank @patrickdrd
                      last edited by Jun 2, 2018, 3:04 PM

                      The correct one is

                      D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\lib\lib-tk

                      but those

                      D:\Utilities\PortableApps\Notepad++\lib
                      D:\Utilities\PortableApps\Notepad++\lib\plat-win
                      D:\Utilities\PortableApps\Notepad++\lib\lib-tk

                      are strange, could it be that you unzipped only part of tk packages into
                      D:\Utilities\PortableApps\Notepad++\ ?

                      Can you check if you have the following files under D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\lib\lib-tk

                      Canvas.py
                      Dialog.py
                      FileDialog.py
                      FixTk.py
                      ScrolledText.py
                      SimpleDialog.py
                      Tix.py
                      tkColorChooser.py
                      tkCommonDialog.py
                      Tkconstants.py
                      Tkdnd.py
                      tkFileDialog.py
                      tkFont.py
                      Tkinter.py
                      tkMessageBox.py
                      tkSimpleDialog.py
                      ttk.py
                      turtle.py

                      You might see additional files with extension pyc - that’s ok.

                      If you do have the files, delete the D:\Utilities\PortableApps\Notepad++\lib directory.
                      If you don’t have the files under D:\Utilities\PortableApps\Notepad++\lib\lib-tk but
                      within D:\Utilities\PortableApps\Notepad++\lib then cut D:\Utilities\PortableApps\Notepad++\lib and paste it into
                      D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\

                      Cheers
                      Claudia

                      1 Reply Last reply Reply Quote 0
                      • P
                        patrickdrd
                        last edited by Jun 2, 2018, 3:34 PM

                        I can’t find either D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\lib\lib-tk or D:\Utilities\PortableApps\Notepad++\lib folder in explorer!

                        C 1 Reply Last reply Jun 2, 2018, 3:58 PM Reply Quote 0
                        • C
                          Claudia Frank @patrickdrd
                          last edited by Jun 2, 2018, 3:58 PM

                          so how did you install Tcl/Tk libraries?

                          Cheers
                          Claudia

                          1 Reply Last reply Reply Quote 0
                          • P
                            patrickdrd
                            last edited by Jun 2, 2018, 4:01 PM

                            I extracted the zip of course, the folder you say is in:
                            d:\Utilities\PortableApps\Notepad++\plugins\PythonScript\lib\tcl\lib-tk\

                            both in zip file and my explorer!

                            1 Reply Last reply Reply Quote 0
                            • P
                              patrickdrd
                              last edited by Jun 2, 2018, 4:22 PM

                              I’ve just read guy038’s post and I’m more confused :S

                              I downloaded the file again and now it’s Last modified: 02 Jun 2018 16:00 UTC
                              and 69930 results,
                              sorting with insensitive (ue and textfx) yields 69284 and the output should be similar,
                              so I should be satisfied by that consensus I guess?

                              C 1 Reply Last reply Jun 2, 2018, 4:34 PM Reply Quote 0
                              • C
                                Claudia Frank @patrickdrd
                                last edited by Jun 2, 2018, 4:34 PM

                                @patrickdrd

                                the easylist file is adblocker file it will change consistently.

                                Regarding the Tcl/Tk installation - you should have unzipped it into
                                D:\Utilities\PortableApps\Notepad++\ directory.

                                The zip contains the complete folder hierachy - as you see on the left side (archive tree)

                                if you did this you normally got a message saying that the plugins folder already exists and
                                if you want it to overwrite -> you should have answered this with yes, didn’t you?

                                Cheers
                                Claudia

                                1 Reply Last reply Reply Quote 0
                                • P
                                  patrickdrd
                                  last edited by Jun 2, 2018, 4:37 PM

                                  yep, that’s what I got https://imgur.com/a/xNQB5Gn

                                  C 1 Reply Last reply Jun 2, 2018, 5:36 PM Reply Quote 0
                                  • C
                                    Claudia Frank @patrickdrd
                                    last edited by Jun 2, 2018, 5:36 PM

                                    @patrickdrd

                                    took some time to understand the difference.
                                    You do have
                                    …\Notepad++\plugins\PythonScript\lib\tcl\lib-tk
                                    where I do have
                                    …\Notepad++\plugins\PythonScript\lib\lib-tk

                                    so the error makes sense as it can’t be found in …\lib\lib-tk

                                    You could try to add the following to your user startup.py script

                                    import sys
                                    sys.path.append(r'D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\lib\tcl\lib-tk')
                                    

                                    and restart npp and do another import Tkinter test.

                                    Cheers
                                    Claudia

                                    1 Reply Last reply Reply Quote 0
                                    • P
                                      patrickdrd
                                      last edited by Jun 2, 2018, 6:25 PM

                                      ok thanks, first thing tomorrow with the morning coffee
                                      :-D

                                      1 Reply Last reply Reply Quote 0
                                      • G
                                        guy038
                                        last edited by guy038 Jun 2, 2018, 6:44 PM Jun 2, 2018, 6:41 PM

                                        Hi, @patrickdrd, and All,

                                        Just for info, doing again my tests ( Last modified: 02 Jun 2018 17:23 UTC ) with N++ sort, followed by the regex S/R, I obtained :

                                        • Original file : 69931 lines

                                        • With regex A ( sensitive ) : 69852 lines ( so, 65 lines deleted, in 56 matches )

                                        • With regex B ( insensitive ) : 69817 lines ( so, 100 lines deleted, in 88 matches )

                                        • With TextFX sort, at column 1, with the option Sort outputs only UNIQUE (at column) lines, it produced a 69285 lines file ( so, 646 lines deleted )


                                        Now, @patrickdrd, I don’t want you to be confused, with my explanations ! I just pointed out the fact that depending of the different ways and softwares, used to remove duplicates lines, you must expect different results ! My solution using first, N++ sort and , secondly, a regex S/R may not give the correct results, because it’s based on comparing consecutive lines, previously sorted !

                                        And , probably, the TextFX plugin, ( whose I don’t know the sort logic ! ) other software, and the last sorter.py script, from @claudia-frank, will produce best results :-))

                                        Finally, I’m thinking that a correct script to delete duplicates lines should not rely on any sort and just compares two individual lines, at a time ! In other words, any line should simply be compared, successively, with each other line of the file !

                                        Cheers,

                                        guy038

                                        P.S. :

                                        BTW, I confirm that the TextFX sort tool, as N++ sort, seems stable : after running it 5 times, the ouput files, produced , are totally identical :-)

                                        1 Reply Last reply Reply Quote 0
                                        • P
                                          patrickdrd
                                          last edited by Jun 3, 2018, 7:12 AM

                                          @Claudia-Frank still doesn’t work:

                                          File “<console>”, line 1, in <module>
                                          File “D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\lib\tcl\lib-tk\Tkinter.py ”, line 39, in <module>
                                          import _tkinter # If this fails your Python may not be configured for Tk
                                          ImportError: No module named _tkinter

                                          C 1 Reply Last reply Jun 3, 2018, 8:12 PM Reply Quote 0
                                          56 out of 75
                                          • First post
                                            56/75
                                            Last post
                                          The Community of users of the Notepad++ text editor.
                                          Powered by NodeBB | Contributors