Community
    • 登入

    Remove duplicate numerical lines

    已排程 已置頂 已鎖定 已移動 Help wanted · · · – – – · · ·
    48 貼文 8 Posters 18.2k 瀏覽
    正在載入更多貼文
    • 從舊到新
    • 從新到舊
    • 最多點贊
    回覆
    • 在新貼文中回覆
    登入後回覆
    此主題已被刪除。只有擁有主題管理權限的使用者可以查看。
    • rizla kostasR
      rizla kostas
      最後由 編輯

      Remove duplicate numerical lines

      the type of duplicate data is

      5.8,7.9,5.7,0.1,2.3,0.1,2.8,0.1,2.9,5.0,1.2,5.0,1.2,7.0,1.5,0.1,1.6,5.0,1.8,5.0,2.8,5.0,1.9,0.1,3.4,5.0,3.1,0.1,1.2,7.0,2.7,5.0,2
      7.3,2.6,5.7,0.1,3.6,0.1,2.7,5.0,2.0,5.0,1.5,2.0,1.2,8.0,1.2,2.0,1.9,5.0,1.5,7.0,2.3,5.0,1.9,5.0,4.3,5.0,4.5,0.1,1.4,2.0,2.2,5.0,1
      7.2,6.6,5.7,0.1,2.8,0.1,2.8,0.1,2.4,0.1,1.3,8.0,1.2,7.0,1.3,3.0,1.7,8.0,1.7,2.0,2.5,5.0,1.9,0.1,3.9,0.1,4.5,0.1,1.3,3.0,2.5,0.1,0
      4.5,5.5,5.7,0.1,1.7,5.0,3.3,0.1,3.9,0.1,1.1,3.0,1.1,8.0,1.8,7.0,1.5,5.0,2.1,0.1,3.5,5.0,2.1,0.1,2.7,5.0,1.6,0.1,1.1,7.0,3.4,0.1,2
      4.3,3.3,2.7,0.1,1.3,2.0,4.1,0.1,7.1,0.1,1.0,4.0,1.1,1.0,2.3,0.1,1.6,3.0,1.8,7.0,4.6,0.1,2.0,5.0,2.3,0.1,9.1,0.1,1.1,1.0,4.1,0.1,0
      5.8,7.9,5.7,0.1,2.3,0.1,2.8,0.1,2.9,5.0,1.2,5.0,1.2,7.0,1.5,0.1,1.6,5.0,1.8,5.0,2.8,5.0,1.9,0.1,3.4,5.0,3.1,0.1,1.2,7.0,2.7,5.0,2
      7.3,2.6,5.7,0.1,3.6,0.1,2.7,5.0,2.0,5.0,1.5,2.0,1.2,8.0,1.2,2.0,1.9,5.0,1.5,7.0,2.3,5.0,1.9,5.0,4.3,5.0,4.5,0.1,1.4,2.0,2.2,5.0,1
      7.2,6.6,5.7,0.1,2.8,0.1,2.8,0.1,2.4,0.1,1.3,8.0,1.2,7.0,1.3,3.0,1.7,8.0,1.7,2.0,2.5,5.0,1.9,0.1,3.9,0.1,4.5,0.1,1.3,3.0,2.5,0.1,0
      4.5,5.5,5.7,0.1,1.7,5.0,3.3,0.1,3.9,0.1,1.1,3.0,1.1,8.0,1.8,7.0,1.5,5.0,2.1,0.1,3.5,5.0,2.1,0.1,2.7,5.0,1.6,0.1,1.1,7.0,3.4,0.1,2
      4.3,3.3,2.7,0.1,1.3,2.0,4.1,0.1,7.1,0.1,1.0,4.0,1.1,1.0,2.3,0.1,1.6,3.0,1.8,7.0,4.6,0.1,2.0,5.0,2.3,0.1,9.1,0.1,1.1,1.0,4.1,0.1,0

      how to remove dublicate lines?some help please

      1 條回覆 最後回覆 回覆 引用 1
      • Terry RT
        Terry R
        最後由 Terry R 編輯

        @rizla-kostas
        We’re going to need some more information.

        1. Can you use the sort lines function? This allows the duplicated lines to appear next to the original, making the regular expression (regex) much easier to create.
        2. If answer to 1. is “no”, then is the file large, say 20000 lines or more? You see there is an issue with large files, when using a lookahead in the regex function that may mean it will fail.
        3. When a duplicate is found, which line is to be removed? This question makes more sense if you don’t want the lines sorted. If lines sorted then removing either line makes no difference.

        I note that your example has 5 pairs of lines, is this a good example of the real data (high percentage of duplicates)?

        Terry

        1 條回覆 最後回覆 回覆 引用 3
        • rizla kostasR
          rizla kostas
          最後由 編輯

          hi thanks for the reply

          Can you use the sort lines function?
          it’s about 6.500 lines and i can’t sort them

          When a duplicate is found, which line is to be removed?
          the last one

          I note that your example has 5 pairs of lines, is this a good example of the real data (high percentage of duplicates)?
          no low percentage of duplicates just i wanted to show the kind of data

          1 條回覆 最後回覆 回覆 引用 2
          • Scott SumnerS
            Scott Sumner
            最後由 編輯

            We have been around the block with the regular expression solution to this. There are also Pythonscript and OS-level solutions. How about one more KISS version of a Pythonscript? This is about as simple and barebones as it gets…maybe let’s see what kind of limitations are encountered with its use:

            from Npp import notepad, editor
            eol = ['\r\n', '\r', '\n'][notepad.getFormatType()]
            line_dict = {}
            line_removal_list = []
            for j in range(editor.getLineCount()):
                l = editor.getLine(j)
                if len(l) > len(eol):
                    if l in line_dict:
                        line_removal_list.append(j)
                    else:
                        line_dict[l] = None
            if len(line_removal_list) > 0:
                editor.beginUndoAction()
                # remove lines in highest-line-number to lowest-line-number fashion:
                for j in line_removal_list[::-1]: editor.deleteLine(j)
                editor.endUndoAction()
            
            1 條回覆 最後回覆 回覆 引用 2
            • rizla kostasR
              rizla kostas
              最後由 編輯

              nice this is a python script how to run it in notepad++

              you make all lines as arrays and remove duplicates?

              Scott SumnerS 1 條回覆 最後回覆 回覆 引用 0
              • Scott SumnerS
                Scott Sumner @rizla kostas
                最後由 Scott Sumner 編輯

                @rizla-kostas

                python script how to run it in notepad++

                Well you need to install the Pythonscript plugin. :)

                The script makes the contents of each lines a dictionary key (thus, unique). As each line is examined, if there is already a key in the dictionary, we know that that line has already occurred, so it is added to a list of line numbers to delete. After all lines have been examined, we run through the list of duplicate line numbers in reverse order (high-to-low) and delete them. Why high-to-low? Because if we delete them low-to-high it would interfere with the remainder of the line numbers, example: if you need to delete lines 5 and 7 and you delete line 5 first, the original line 7 is now line 6! If you delete line 7 first then line 5 is still the one you want to delete next.

                1 條回覆 最後回覆 回覆 引用 2
                • PeterJonesP
                  PeterJones
                  最後由 編輯

                  @rizla-kostas ,

                  There is a plugin for Notepad++ called “PythonScript”, which embeds a Python interpreter inside the plugin, and allows automation of the Notepad++ GUI/Environment/editor-component through the Python language. If you install PythonScript (some useful links below), then you can run those programs from the PythonScript plugin’s menu.

                  -----

                  • PythonScript HOME
                  • PythonScript DOWNLOAD
                  • HELP = Plugins > Python Script > Context-Help
                  • Getting Started with Python
                  1 條回覆 最後回覆 回覆 引用 2
                  • rizla kostasR
                    rizla kostas
                    最後由 rizla kostas 編輯

                    thank you so so so much all of you guys behind notepad++

                    i will test it tomorrow and i will report back thanks again

                    1 條回覆 最後回覆 回覆 引用 2
                    • Terry RT
                      Terry R
                      最後由 編輯

                      @Scott-Sumner
                      I just tested your pythonscript and I think it misses 1 dup, possibly due to the last line having no CRLF. I added that and it then worked as expected (for me).

                      I’m trying to learn pythonscript, but unable to see where in your code the problem might be arising.

                      Terry

                      Scott SumnerS 1 條回覆 最後回覆 回覆 引用 2
                      • Scott SumnerS
                        Scott Sumner @Terry R
                        最後由 編輯

                        @Terry-R

                        Hey Terry!

                        Is the last line which doesn’t have a line-ending REALLY a duplicate of an earlier line that does have a line-ending? :) Well, okay, it IS if we are talking about line-endingless content, which we (probably) are.

                        Anyway, the culprit line in the code would be the one with editor.getLineCount() in it. You will have one less line without a line-ending on your last line, and thus the range function will cause it to go one less iteration. But also to blame is that when the script remembers a previously encountered line, it does so WITH THE LINE-ENDING ON. So there’s a double reason for failure here.

                        I don’t like files without line-endings on their last lines. I sure do wish there was an option in N++ to automatically make sure lines all have proper ends on them. [Of course I have a Pythonscript that makes sure of this for me, so I don’t usually remember to take this stuff into account.]

                        BTW, note that the script ignores blank lines; something I should have mentioned earlier.

                        1 條回覆 最後回覆 回覆 引用 2
                        • PeterJonesP
                          PeterJones
                          最後由 編輯

                          So @Scott-Sumner, are you going to leave us hanging? You need to publish the code to add the line-ending to the last line, if it’s missing it, so that your above code works properly. :-)

                          Scott SumnerS 2 條回覆 最後回覆 回覆 引用 2
                          • Scott SumnerS
                            Scott Sumner @PeterJones
                            最後由 編輯

                            @PeterJones said:

                            You need to publish the code to add the line-ending to the last line…so that your above code works properly

                            HAHa. I will, but right now it looks overcomplicated for general use. :-) I’ll work on it and post back here when it is suitable for general consumption…

                            In the meanwhile, why not let’s just fix the original code? I found that all that is needed is to change this line:

                            l = editor.getLine(j)
                            

                            into this:

                            l = editor.getLine(j).rstrip('\n\r')
                            
                            Eko palypseE 1 條回覆 最後回覆 回覆 引用 2
                            • Eko palypseE
                              Eko palypse @Scott Sumner
                              最後由 編輯

                              @Scott-Sumner

                              what about using OrderedDict from collections?
                              Preserves the ordering and dict keys are unique per se.

                              from Npp import editor
                              from collections import OrderedDict
                              _dict = OrderedDict.fromkeys(editor.getText().splitlines())
                              editor.setText('\r\n'.join(_dict.keys()))
                              

                              Eko

                              Scott SumnerS 1 條回覆 最後回覆 回覆 引用 1
                              • Scott SumnerS
                                Scott Sumner @Eko palypse
                                最後由 編輯

                                @Eko-palypse said:

                                what about…?

                                Sure, why not? Only objection might be the empty line case (my experience is that people usually want their blank lines retained as is, and not removed as duplicates).

                                Eko palypseE 1 條回覆 最後回覆 回覆 引用 1
                                • Eko palypseE
                                  Eko palypse @Scott Sumner
                                  最後由 編輯

                                  @Scott-Sumner

                                  right, this case makes it a little bit more difficulty, agreed.

                                  Eko

                                  1 條回覆 最後回覆 回覆 引用 0
                                  • Eko palypseE
                                    Eko palypse
                                    最後由 Eko palypse 編輯

                                    @Scott-Sumner

                                    What about this

                                    from Npp import editor
                                    lastLineContainsEOL = True if len(editor.getLine(editor.getLineCount()-1)) == 0 else False
                                    lines = editor.getText().splitlines()
                                    uniqueLines = set(lines)
                                    newText = '' 
                                    for line in lines:
                                        if line in uniqueLines or line.strip() == '':
                                            newText += line + '\r\n'
                                            if line.strip() != '':
                                                uniqueLines.remove(line)
                                    editor.setText(newText if lastLineContainsEOL else newText[:-2])
                                    
                                    • generates unique lines only (ignoring empty lines with and without spaces)
                                    • preserves ordering
                                    • preserves usage of last EOL

                                    Eko

                                    Scott SumnerS 1 條回覆 最後回覆 回覆 引用 2
                                    • Scott SumnerS
                                      Scott Sumner @Eko palypse
                                      最後由 編輯

                                      @Eko-palypse said:

                                      What about this

                                      Sure. I say “whatever works”. Much like I don’t get all fancy about shaving a few characters off a regex, I think with scripts it is to each his own. As long as it does the job, it is super. :-)

                                      1 條回覆 最後回覆 回覆 引用 0
                                      • Scott SumnerS
                                        Scott Sumner
                                        最後由 編輯

                                        @Eko-palypse

                                        One comment, though: I’m guessing you pretty much exclusively use Windows. I use Windows/Linux about 75%/25%…because of that I have learned to not think that line-endings are always \r\n. So scripts I post here will work (that’s the goal anyway) with either Windows or Linux (or even Mac) files.

                                        This may be something you want to consider doing as well. But it doesn’t bother me if you don’t because I understand the meaning of it–for someone that just wants to blindly pick up and use a script and doesn’t understand Python, oh and BTW uses Linux files…it could be a problem.

                                        BTW, good job! I like seeing Pythonscripts besides my own posted here. Not many people are doing it anymore. :-(

                                        1 條回覆 最後回覆 回覆 引用 3
                                        • Scott SumnerS
                                          Scott Sumner @PeterJones
                                          最後由 編輯

                                          @PeterJones said:

                                          You need to publish the code to add the line-ending to the last line, if it’s missing it

                                          Ok, so here it is; I run a similar (but more complicated one for my own needs) from my startup.py so that it is always in place–and thus I never have to deal with files without line-endings on their last lines.

                                          One thing I don’t like, but haven’t found a good method for handling, is that in certain circumstances (e.g. a Save All), after the script does its work, it can leave you sitting in an tab that is different from the tab that was active before. If people are interested in this script and have ideas about solving that particular problem, I’m interested in hearing them.

                                          Here’s the Pythonscript:

                                          from Npp import notepad, editor, NOTIFICATION
                                          
                                          def callback_npp_FILEBEFORESAVE(args):
                                              line_ending = ['\r\n', '\r', '\n'][notepad.getFormatType()]
                                              doc_size = editor.getTextLength()
                                              if editor.getTextRange(doc_size - 1, doc_size) != line_ending[-1]:
                                                  # fix Notepad++'s "broken" functionality and add a line-ending at end-of-file
                                                  editor.appendText(line_ending)
                                          
                                          notepad.callback(callback_npp_FILEBEFORESAVE, [NOTIFICATION.FILEBEFORESAVE])
                                          
                                          1 條回覆 最後回覆 回覆 引用 2
                                          • Eko palypseE
                                            Eko palypse
                                            最後由 編輯

                                            @Scott-Sumner said:

                                            One comment, though: I’m guessing you pretty much exclusively use Windows. I use Windows/Linux about 75%/25%…because of that I have learned to not think that line-endings are always \r\n. So scripts I post here will work (that’s the goal anyway) with either Windows or Linux (or even Mac) files.

                                            Good point and you offered the solution already, even better :-D

                                            from Npp import editor
                                            lastLineContainsEOL = True if len(editor.getLine(editor.getLineCount()-1)) == 0 else False
                                            line_ending = ['\r\n', '\r', '\n'][notepad.getFormatType()]
                                            lines = editor.getText().splitlines()
                                            uniqueLines = set(lines)
                                            newText = '' 
                                            for line in lines:
                                                if line in uniqueLines or line.strip() == '':
                                                    newText += line + line_ending 
                                                    if line.strip() != '':
                                                        uniqueLines.remove(line)
                                            editor.setText(newText if lastLineContainsEOL else newText[:-2])
                                            

                                            Eko

                                            Scott SumnerS 2 條回覆 最後回覆 回覆 引用 2
                                            • 第一個貼文
                                              最後的貼文
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors