• Login
Community
  • Login

sort file removing duplicates possible?

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
75 Posts 5 Posters 44.6k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • C
    Claudia Frank
    last edited by Jun 1, 2018, 12:58 PM

    @Patrick

    sorry, don’t know the term “accent insensitive” , what does it mean?
    For example that è is the same as e?

    Can you provide example data (just need a couple of lines) to see if it is working correctly?
    The speed test I will do with the easylist text.

    Cheers
    Claudia

    P 1 Reply Last reply Jun 1, 2018, 1:09 PM Reply Quote 0
    • S
      Scott Sumner
      last edited by Jun 1, 2018, 1:01 PM

      I can envision the following specification for a general purpose script (goes beyond what @patrickdrd has asked for):

      • sort lines case sensitive, keep duplicate lines
      • sort lines case sensitive, remove duplicate lines
      • sort lines case sensitive, keep duplicate lines, reverse order on the sort
      • sort lines case sensitive, remove duplicate lines, reverse order on the sort
      • sort lines case insensitive, keep duplicate lines
      • sort lines case insensitive, remove duplicate lines
      • sort lines case insensitive, keep duplicate lines, reverse order on the sort
      • sort lines case insensitive, remove duplicate lines, reverse order on the sort
      • no sort, remove duplicate lines (case sensitive)
      • no sort, remove duplicate lines (case insensitive)

      A selection active when invoking should define:

      • lines to be affected (only those touched by selection vertically); act on all lines if no selection
      • columns to be used as the sort key (if rectangular selection use the selected columns as the sort key; if zero-width rect selection consider the key to start at the selection column out to the end of the variable-length lines)

      Probably I’ve forgotten something important to this “sort” of thing…

      :-D

      S 1 Reply Last reply Jun 1, 2018, 1:30 PM Reply Quote 1
      • P
        patrickdrd @Claudia Frank
        last edited by Jun 1, 2018, 1:09 PM

        @Claudia-Frank said:

        @Patrick

        sorry, don’t know the term “accent insensitive” , what does it mean?
        For example that è is the same as e?

        Can you provide example data (just need a couple of lines) to see if it is working correctly?
        The speed test I will do with the easylist text.

        Cheers
        Claudia

        yes, exactly that

        1 Reply Last reply Reply Quote 1
        • C
          Claudia Frank
          last edited by Jun 1, 2018, 1:26 PM

          OK - let’s see what we can do.

          Cheers
          Claudia

          1 Reply Last reply Reply Quote 1
          • S
            Scott Sumner @Scott Sumner
            last edited by Jun 1, 2018, 1:30 PM

            @Scott-Sumner

            …the following spec…

            Hey Scott! You did forget some things! How about when removing duplicates, we need the options to:

            • keep one occurrence of a duplicated line (when sorting)
            • keep no occurrences of a duplicated line (when sorting or not sorting)
            • keep LAST occurrence of a duplicated line (when not sorting)
            • keep FIRST occurrence of a duplicated line (when not sorting)
            C 1 Reply Last reply Jun 1, 2018, 1:32 PM Reply Quote 0
            • C
              Claudia Frank @Scott Sumner
              last edited by Jun 1, 2018, 1:32 PM

              @Scott-Sumner

              ahh - sorry too late - specs already defined for version 1 you need to open a feature request for version 2 :-D

              Cheers
              Claudia

              1 Reply Last reply Reply Quote 4
              • C
                Claudia Frank
                last edited by Claudia Frank Jun 2, 2018, 2:50 AM Jun 2, 2018, 2:49 AM

                You can find the 1st version of the script here .

                In order to make it run there are two requirements, apart from the obvious one that you need to have python script plugin installed, which needs to be full-filled.

                1.) be sue you have either installed the full package or download and unzip the TclTk into the NPP_INSTALL_DIR. Latest releases

                2.) in order to make the “accent insensitive” feature working it is needed to install a python library called unidecode .
                Unzip the .whl package into NPP_INSTALL_DIR\plugins\lib\

                To check both requirements, open the python script console and do the following commands

                 import Tkinter
                 import unidecode
                

                If you don’t see any errors - done.
                Usage is simple - run the script and check the different options.

                What should work is

                • sort/delete duplicates on whole text (aka nothing is selected)
                • sort/delete duplicates on vertically selected text

                not supported yet:

                • sort/delete duplicates on rectangular selection

                Cheers
                Claudia

                Btw. I spent most of the time creating this ugly window - so if someone wants to create a nicer gui - please go for it. I’m not really good in designing UIs.

                1 Reply Last reply Reply Quote 2
                • P
                  patrickdrd
                  last edited by Jun 2, 2018, 7:13 AM

                  tkinter doesn’t work:

                  Python 2.7.6-notepad++ r2 (default, Apr 21 2014, 19:26:54) [MSC v.1600 32 bit (Intel)]
                  Initialisation took 219ms
                  Ready.

                  import Tkinter
                  Traceback (most recent call last):
                  File “<console>”, line 1, in <module>
                  ImportError: No module named Tkinter
                  Traceback (most recent call last):
                  File “D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\scripts\Sorter.py ”, line 5, in <module>
                  import Tkinter as tk
                  ImportError: No module named Tkinter

                  1 Reply Last reply Reply Quote 0
                  • P
                    patrickdrd
                    last edited by Jun 2, 2018, 8:39 AM

                    I found out something else, about that easylist file,
                    textfx’s case insensitive sort results in 69234 entries,
                    which is the same as ue’s result!

                    S 1 Reply Last reply Jun 2, 2018, 12:23 PM Reply Quote 0
                    • G
                      guy038
                      last edited by guy038 Jun 2, 2018, 10:43 AM Jun 2, 2018, 10:36 AM

                      Hello, @patrickdrd, and All,

                      Well, I must complete my previous post !

                      • Firstly, I realized that your list, below, is constantly updated ( Last modified: 02 Jun 2018 08:09 UTC )

                      https://easylist.to/easylist/easylist.txt

                      So, today, this list contains 69917 lines


                      • Secondly, when performing the regex S/R, we must consider, both, sensitive and insensitive search => The two search regexes :

                      Regex A : (?-is)(^.+\R)\1+

                      Regex B : (?i-s)(^.+\R)\1+

                      give, after sorting and removing duplicates with the regex, a file containing :

                      A 69852 lines ( so, 65 lines deleted, in 56 matches )

                      B 69817 lines ( so, 100 lines deleted, in 88 matches )


                      • Thirdly, we, also, must take in account the possibility that the sort, itself, is run in a sensitive or insensitive way !

                      Natively, Notepad++ sort text, according to the Unicode value ( code-point ) of characters ( a kind of sensitive sort ! ) whereas some other text editors may consider these two case options, leading to different results !

                      For instance, using the RJ TextEd software, here are the differences with a simple list of three-characters strings ( 1 x ‘ABC’, 2 x ‘AbC’, 3 x ‘Abc’,1 x ‘DEF’, 3 x ‘DEf’, 2 x ‘aBC’, 3 x ‘aBc’, 3 x ‘dEF’ and 3 x ‘def’ )

                                      •-----------------------•---------------------------•
                                      |    with Notepad++     |      with RJ TextEd       |
                      •---------------•-----------------------•---------------------------•
                      |  Before Sort  |  After UNICODE Sort   |   After SENSITIVE Sort    |
                      •---------------•-----------------------•---------------------------•
                      |      Abc      |          ABC          |            AbC            |
                      |      Abc      |          AbC          |            aBC            |
                      |      DEf      |          AbC          |            aBc            |
                      |      AbC      |          Abc          |            Abc            |
                      |      aBc      |          Abc          |            ABC            |
                      |      dEF      |          Abc          |            aBC            |
                      |      aBc      |          DEF          |            Abc            |
                      |      DEf      |          DEf          |            AbC            |
                      |      aBC      |          DEf          |            aBc            |
                      |      def      |          DEf          |            aBc            |
                      |      AbC      |          aBC          |            Abc            |
                      |      def      |          aBC          |            DEf            |
                      |      def      |          aBc          |            dEF            |
                      |      ABC      |          aBc          |            dEF            |
                      |      Abc      |          aBc          |            DEf            |
                      |      DEF      |          dEF          |            dEF            |
                      |      aBc      |          dEF          |            DEf            |
                      |      dEF      |          dEF          |            def            |
                      |      dEF      |          def          |            DEF            |
                      |      DEf      |          def          |            def            |
                      |      aBC      |          def          |            def            |
                      •---------------•-----------------------•---------------------------•
                      

                      So, it’s easy to understand that removing consecutive duplicates, after the sort, with the regexes above, will, necessarily, give results totally different, depending of the software used :-(


                      • Fourthly, sort may give different results, after being run several times, one after another. For instance, with RJ TextEd, running 6 times the insensitive sort on the 3 character list above, I was left with 3 sets of data ( 4 times, identical to the sensitive sort and two other lists !! Luckily, as for Notepad++, its Unicode sort always give identical results :-))

                      That’s why, @patrickdrd, it’s very difficult, finally, to compare results between different softwares, at each piece have its own behavior !

                      Cheers,

                      guy038

                      P.S. :

                      Here are the results of my tests :

                      1) With Notepad++ and RJ TextEd, using sensitive sort :

                      •---------------------------------------------------------------------------------------------------•
                      |                                      with Notepad++ Features                                      |
                      •---------------•-------------------------•---------------------------•-----------------------------•
                      |  Before Sort  |  After Sensitive Sort   |  After Sensitive Regex +  |  After INsensitive Regex +  |
                      |               |                         |  Suppression Duplicates   |   Suppression Duplicates    |
                      •---------------•-------------------------•---------------------------•-----------------------------•
                      |      Abc      |           ABC           |            ABC            |             ABC             |
                      |      Abc      |           AbC           |            AbC            |             DEF             |
                      |      DEf      |           AbC           |            Abc            |             aBC             |
                      |      AbC      |           Abc           |            DEF            |             dEF             |
                      |      aBc      |           Abc           |            DEf            |                             |
                      |      dEF      |           Abc           |            aBC            |                             |
                      |      aBc      |           DEF           |            aBc            |                             |
                      |      DEf      |           DEf           |            dEF            |                             |
                      |      aBC      |           DEf           |            def            |                             |
                      |      def      |           DEf           |                           |                             |
                      |      AbC      |           aBC           |                           |                             |
                      |      def      |           aBC           |                           |                             |
                      |      def      |           aBc           |                           |                             |
                      |      ABC      |           aBc           |                           |                             |
                      |      Abc      |           aBc           |                           |                             |
                      |      DEF      |           dEF           |                           |                             |
                      |      aBc      |           dEF           |                           |                             |
                      |      dEF      |           dEF           |                           |                             |
                      |      dEF      |           def           |                           |                             |
                      |      DEf      |           def           |                           |                             |
                      |      aBC      |           def           |                           |                             |
                      •---------------•-------------------------•---------------------------•-----------------------------•
                      
                      
                      •---------------------------------------------------------------------------------------------------•
                      |                                       with RJ TextEd Features                                     |
                      •---------------•-------------------------•---------------------------•-----------------------------•
                      |  Before Sort  |  After Sensitive Sort   |  After Sensitive Regex +  |  After INsensitive Regex +  |
                      |               |                         |  Suppression Duplicates   |   Suppression Duplicates    |
                      •---------------•-------------------------•---------------------------•-----------------------------•
                      |      Abc      |           AbC           |            AbC            |             AbC             |
                      |      Abc      |           aBC           |            aBC            |             DEf             |
                      |      DEf      |           aBc           |            aBc            |                             |
                      |      AbC      |           Abc           |            Abc            |                             |
                      |      aBc      |           ABC           |            ABC            |                             |
                      |      dEF      |           aBC           |            aBC            |                             |
                      |      aBc      |           Abc           |            Abc            |                             |
                      |      DEf      |           AbC           |            AbC            |                             |
                      |      aBC      |           aBc           |            aBc            |                             |
                      |      def      |           aBc           |            Abc            |                             |
                      |      AbC      |           Abc           |            DEf            |                             |
                      |      def      |           DEf           |            dEF            |                             |
                      |      def      |           dEF           |            DEf            |                             |
                      |      ABC      |           dEF           |            dEF            |                             |
                      |      Abc      |           DEf           |            DEf            |                             |
                      |      DEF      |           dEF           |            def            |                             |
                      |      aBc      |           DEf           |            DEF            |                             |
                      |      dEF      |           def           |            def            |                             |
                      |      dEF      |           DEF           |                           |                             |
                      |      DEf      |           def           |                           |                             |
                      |      aBC      |           def           |                           |                             |
                      •---------------•-------------------------•---------------------------•-----------------------------•
                      

                      2) When running, several times, an insensitive sort, with RJ TextEd, I obtained 3 different lists :

                      • The first one was identical to the table just above, which uses a sensitive sort

                      • The two others are listed below !

                      •---------------------------------------------------------------------------------------------------•
                      |                                       with RJ TextEd Features                                     |
                      •---------------•-------------------------•---------------------------•-----------------------------•
                      |  Before Sort  |  After INsensitive Sort |  After Sensitive Regex +  |  After INsensitive Regex +  |
                      |               |                         |  Suppression Duplicates   |   Suppression Duplicates    |
                      •---------------•-------------------------•---------------------------•-----------------------------•
                      |      Abc      |          ABC            |            ABC            |             ABC             |
                      |      Abc      |          AbC            |            AbC            |             dEF             |
                      |      DEf      |          aBC            |            aBC            |                             |
                      |      AbC      |          aBC            |            aBc            |                             |
                      |      aBc      |          aBc            |            Abc            |                             |
                      |      dEF      |          Abc            |            AbC            |                             |
                      |      aBc      |          AbC            |            aBc            |                             |
                      |      DEf      |          aBc            |            Abc            |                             |
                      |      aBC      |          Abc            |            aBc            |                             |
                      |      def      |          Abc            |            dEF            |                             |
                      |      AbC      |          aBc            |            DEf            |                             |
                      |      def      |          dEF            |            def            |                             |
                      |      def      |          dEF            |            dEF            |                             |
                      |      ABC      |          DEf            |            DEF            |                             |
                      |      Abc      |          DEf            |            def            |                             |
                      |      DEF      |          DEf            |                           |                             |
                      |      aBc      |          def            |                           |                             |
                      |      dEF      |          def            |                           |                             |
                      |      dEF      |          dEF            |                           |                             |
                      |      DEf      |          DEF            |                           |                             |
                      |      aBC      |          def            |                           |                             |
                      •---------------•-------------------------•---------------------------•-----------------------------•
                      
                      
                      •---------------------------------------------------------------------------------------------------•
                      |                                       with RJ TextEd Features                                     |
                      •---------------•-------------------------•---------------------------•-----------------------------•
                      |  Before Sort  |  After INsensitive Sort |  After Sensitive Regex +  |  After INsensitive Regex +  |
                      |               |                         |  Suppression Duplicates   |   Suppression Duplicates    |
                      •---------------•-------------------------•---------------------------•-----------------------------•
                      |      Abc      |          ABC            |           ABC             |            ABC              |
                      |      Abc      |          AbC            |           AbC             |            dEF              |
                      |      DEf      |          aBC            |           aBC             |                             |
                      |      AbC      |          aBC            |           aBc             |                             |
                      |      aBc      |          aBc            |           Abc             |                             |
                      |      dEF      |          Abc            |           aBc             |                             |
                      |      aBc      |          aBc            |           AbC             |                             |
                      |      DEf      |          AbC            |           Abc             |                             |
                      |      aBC      |          Abc            |           aBc             |                             |
                      |      def      |          Abc            |           dEF             |                             |
                      |      AbC      |          aBc            |           DEf             |                             |
                      |      def      |          dEF            |           def             |                             |
                      |      def      |          dEF            |           DEF             |                             |
                      |      ABC      |          dEF            |           DEf             |                             |
                      |      Abc      |          DEf            |                           |                             |
                      |      DEF      |          DEf            |                           |                             |
                      |      aBc      |          def            |                           |                             |
                      |      dEF      |          def            |                           |                             |
                      |      dEF      |          def            |                           |                             |
                      |      DEf      |          DEF            |                           |                             |
                      |      aBC      |          DEf            |                           |                             |
                      •---------------•-------------------------•---------------------------•-----------------------------•
                      
                      1 Reply Last reply Reply Quote 1
                      • C
                        Claudia Frank
                        last edited by Jun 2, 2018, 12:22 PM

                        Patrick, did you downlaod and unzip the TclTk into the
                        NPP_INSTALL_DIR ? (in your case into D:\Utilities\PortableApps\Notepad++)

                        If so, can you run the following in the python script console

                        import sys; print '\n'.join(sys.path) 
                        

                        and post the output?

                        Did the unidecode library installation work?

                        Cheers
                        Claudia

                        1 Reply Last reply Reply Quote 0
                        • S
                          Scott Sumner @patrickdrd
                          last edited by Jun 2, 2018, 12:23 PM

                          This post is deleted!
                          1 Reply Last reply Reply Quote 0
                          • P
                            patrickdrd
                            last edited by Jun 2, 2018, 1:50 PM

                            yes, unidecode works fine, import command works

                            C 1 Reply Last reply Jun 2, 2018, 1:59 PM Reply Quote 0
                            • C
                              Claudia Frank @patrickdrd
                              last edited by Jun 2, 2018, 1:59 PM

                              @patrickdrd

                              what does the sys.path report?

                              Cheers
                              Claudia

                              1 Reply Last reply Reply Quote 0
                              • P
                                patrickdrd
                                last edited by Jun 2, 2018, 2:43 PM

                                D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\lib
                                D:\Utilities\PortableApps\Notepad++\plugins\Config\PythonScript\lib
                                D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\scripts
                                D:\Utilities\PortableApps\Notepad++\plugins\Config\PythonScript\scripts
                                D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\lib\lib-tk
                                D:\Utilities\PortableApps\Notepad++\python27.zip
                                D:\Utilities\PortableApps\Notepad++\DLLs
                                D:\Utilities\PortableApps\Notepad++\lib
                                D:\Utilities\PortableApps\Notepad++\lib\plat-win
                                D:\Utilities\PortableApps\Notepad++\lib\lib-tk
                                D:\Utilities\PortableApps\Notepad++

                                C 1 Reply Last reply Jun 2, 2018, 3:04 PM Reply Quote 0
                                • C
                                  Claudia Frank @patrickdrd
                                  last edited by Jun 2, 2018, 3:04 PM

                                  The correct one is

                                  D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\lib\lib-tk

                                  but those

                                  D:\Utilities\PortableApps\Notepad++\lib
                                  D:\Utilities\PortableApps\Notepad++\lib\plat-win
                                  D:\Utilities\PortableApps\Notepad++\lib\lib-tk

                                  are strange, could it be that you unzipped only part of tk packages into
                                  D:\Utilities\PortableApps\Notepad++\ ?

                                  Can you check if you have the following files under D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\lib\lib-tk

                                  Canvas.py
                                  Dialog.py
                                  FileDialog.py
                                  FixTk.py
                                  ScrolledText.py
                                  SimpleDialog.py
                                  Tix.py
                                  tkColorChooser.py
                                  tkCommonDialog.py
                                  Tkconstants.py
                                  Tkdnd.py
                                  tkFileDialog.py
                                  tkFont.py
                                  Tkinter.py
                                  tkMessageBox.py
                                  tkSimpleDialog.py
                                  ttk.py
                                  turtle.py

                                  You might see additional files with extension pyc - that’s ok.

                                  If you do have the files, delete the D:\Utilities\PortableApps\Notepad++\lib directory.
                                  If you don’t have the files under D:\Utilities\PortableApps\Notepad++\lib\lib-tk but
                                  within D:\Utilities\PortableApps\Notepad++\lib then cut D:\Utilities\PortableApps\Notepad++\lib and paste it into
                                  D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\

                                  Cheers
                                  Claudia

                                  1 Reply Last reply Reply Quote 0
                                  • P
                                    patrickdrd
                                    last edited by Jun 2, 2018, 3:34 PM

                                    I can’t find either D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\lib\lib-tk or D:\Utilities\PortableApps\Notepad++\lib folder in explorer!

                                    C 1 Reply Last reply Jun 2, 2018, 3:58 PM Reply Quote 0
                                    • C
                                      Claudia Frank @patrickdrd
                                      last edited by Jun 2, 2018, 3:58 PM

                                      so how did you install Tcl/Tk libraries?

                                      Cheers
                                      Claudia

                                      1 Reply Last reply Reply Quote 0
                                      • P
                                        patrickdrd
                                        last edited by Jun 2, 2018, 4:01 PM

                                        I extracted the zip of course, the folder you say is in:
                                        d:\Utilities\PortableApps\Notepad++\plugins\PythonScript\lib\tcl\lib-tk\

                                        both in zip file and my explorer!

                                        1 Reply Last reply Reply Quote 0
                                        • P
                                          patrickdrd
                                          last edited by Jun 2, 2018, 4:22 PM

                                          I’ve just read guy038’s post and I’m more confused :S

                                          I downloaded the file again and now it’s Last modified: 02 Jun 2018 16:00 UTC
                                          and 69930 results,
                                          sorting with insensitive (ue and textfx) yields 69284 and the output should be similar,
                                          so I should be satisfied by that consensus I guess?

                                          C 1 Reply Last reply Jun 2, 2018, 4:34 PM Reply Quote 0
                                          50 out of 75
                                          • First post
                                            50/75
                                            Last post
                                          The Community of users of the Notepad++ text editor.
                                          Powered by NodeBB | Contributors