Community
    • Login

    How to delete all lines found in another txt document

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    13 Posts 5 Posters 2.9k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • CookieXDC
      CookieXD
      last edited by

      I have one document called “source.txt” and one called “delete.txt”.
      I would like to delete every line that is in “delete.txt” from “source.txt”.
      That process should be easily repeatable, because I will have to do that many times…

      Thanks a lot for your help!

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hello, @cookiexd and All,

        Here is an easy work-around :

        • Open your source.txt file, in Notepad++

        • At the end of source.txt file, add a new line beginning with, at least, three = equal symbols

        • Then, append the contents of the delete.txt file, after the line =====...

        • To end, add an empty line, at the very end of the file ( IMPORTANT )

        • Now, open the Replace dialog

        • SEARCH (?s-i)^((?-s).+\R)(?=.*^====+\R.*^\1)|^=+\R.+

        • REPLACE Leave EMPTY

        • Tick the Wrap around option

        • Select the Regular expression search mode

        • Click, once, on the Replace All button

        • Save the modifications of your Source.txt file


        Notes :

        • First, the in-line modifiers (?s-i) forces, by default, the regex engine :

          • To acts in a sensitive case way (?-i)

          • To consider the regex symbol . as standing for any single character, even an EOL one (?s)

        • Now, this search regex contains two alternatives, separated with the alternation symbol | :

          • In the first alternative ^((?-s).+\R)(?=.*^====+\R.*^\1) :

            • The part ((?-s).+\R) matches any non-null range of standard characters, due to the (?-s) modifier, followed with its line-break \R ( so, a complete line ! ) and stores it as group1 due the outer parentheses

            • But… ONLY IF  followed with the same line \1, somewhere, further on, after the separator line ===== ..., due to the look-ahead construction (?=.*^====+\R.*^\1)

          • In the second alternative ^=+\R.+, when all initial lines of Source.txt file are processed :

            • The part ^=+\R matches the complete separator line ===..., with its like-break

            • Then, the part .+ selects all the subsequent characters, even EOL ones, after this separator line, till the very end of file

        • Whatever the alternative selected, the matched contents are simply deleted as the Replace zone is empty

        Best Regards,

        guy038

        DimakSerpgD 1 Reply Last reply Reply Quote 7
        • DimakSerpgD
          DimakSerpg @guy038
          last edited by DimakSerpg

          @guy038 why it’s so complicated?
          You are saying “add a new line beginning with, at least, three = equal symbols”
          But then you are saying “after the line =====…”

          So it’s eight symbols now, and for some reason there are 3 dots??
          What?
          I’m unfamiliar with notepad and this doesn’t work.

          PeterJonesP 1 Reply Last reply Reply Quote 0
          • PeterJonesP
            PeterJones @DimakSerpg
            last edited by PeterJones

            @DimakSerpg said in How to delete all lines found in another txt document:

            why it’s so complicated?

            Because it’s essentially trying to recreate a full programming language or database system in something that’s meant for text editing, not database manipulation. I have never heard of a text editor in which “delete all lines found in another txt document” is implemented natively.

            If you search the forum, there’s also examples of using the PythonScript plugin to programmatically do essentially the same thing.

            I’m unfamiliar with notepad

            The application is Notepad++, not notepad. There’s a difference (the latter being the simple app that Microsoft has included with Windows for decades, the former being the high-powered text editor that we talk about in this Forum).

            and this doesn’t work.
            …
            So it’s eight symbols now, and for some reason there are 3 dots??

            He is this forum’s acknowledged regex guru, but even a guru can sometimes make mistakes or not explain things well (especially when they are communicating technical information in a language other than their native language)

            I believe the ... was supposed to indicate that there could be more beyond the initial three equals symbols. And I believe that showing five equals ===== instead of three equals === was just enthusiasm on Guy’s part.

            If it helps, think of those instructions as

            • At the end of source.txt, add a new line beginning with at least three = equal symbols
            • Then append the contents of the delete.txt file after the line you just added

            And given the instructions above, the SEARCH line needs to change as well:

            • SEARCH (?s-i)^((?-s).+\R)(?=.*^===+\R.*^\1)|^=+\R.+
              (it should only have 3 = in a row, not the 4 that Guy originally showed)

            So assuming
            original source.txt:

            this is okay
            delete me
            this was good
            i should be deleted
            fine
            

            and original delete.txt:

            i should be deleted
            delete me
            

            those would be merged into

            this is okay
            delete me
            this was good
            i should be deleted
            fine
            ===
            i should be deleted
            delete me
            
            

            then running FIND WHAT (?s-i)^((?-s).+\R)(?=.*^===+\R.*^\1)|^=+\R.+ REPLACE WITH <empty>, SEARCH MODE = regular expression, click REPLACE ALL, I get:

            this is okay
            this was good
            fine
            
            

            This sequence successfully eliminated the lines from delete.txt that were in source.txt …

            As with all search/replace instructions that you get from a forum, I highly recommend having a backup copy of any data before you run a REPLACE ALL that you don’t understand.

            DimakSerpgD Alan KilbornA 2 Replies Last reply Reply Quote 1
            • DimakSerpgD
              DimakSerpg @PeterJones
              last edited by DimakSerpg

              @PeterJones for some reason it works with your examples.

              But when I use the same method with my text, it doesn’t work.

              Maybe it’s because my text are big? There are like 4.4 million lines, when source and delete files are merged.
              It’s all just numbers. So i want to delete 2 million numbers that are in my source file with 2.4 million numbers.

              After i click “replace all” it just deletes everything.

              But it works without any problems when i pick like 100 lines. So the problem is in 4.4 million lines.

              PeterJonesP 1 Reply Last reply Reply Quote 0
              • PeterJonesP
                PeterJones @DimakSerpg
                last edited by PeterJones

                @DimakSerpg said in How to delete all lines found in another txt document:

                It’s all just numbers. So i want to delete 2 million numbers that are in my source file with 2.4 million numbers.
                After i click “replace all” it just deletes everything.

                That’s a different problem than we normally see with big files and such activity. Normally, big files make it so that there’s not enough space in the regex memory, and the regex will thus not run… But Guy’s regex was intended to be immune to long files (and my modification should have been, too), since the capture-memory of the regex should only be one line’s worth.

                I’m really surprised that its fallback would be to delete everything. (Well, unless the 2.4M in source.txt aren’t unique, and it just so happens that every line in source.txt is also contained in the 2M lines of delete.txt. It might be worth trying Edit > Line Operations > Remove Duplicate Lines on a copy of source.txt, and seeing if there are still more than 2M lines after the removal; if there are 2M or fewer lines, then it’s entirely possible that every unique line matches a line from delete.txt.)

                But it works without any problems when I pick like 100 lines. So the problem is in 4.4 million lines.

                If it’s not multiples of the same line in source.txt, then it’s beyond me. Maybe when Guy or one of the other regex greats has a chance, they can come try to give an alternative that will work with your data.

                It would help if you could provide a list of like 20 lines of source.txt and 5 lines of delete.txt – you can use fake numbers, if there’s something confidential about the numbers, but they should “look like” real data. Someone that has the time and ability could then take those examples, and make huge datafiles that have lots of numbers that are similar to those examples, and see if they can come up with something that works for deleting 2M lines from 2.4M lines of source.

                But I hinted at it before, and will phrase it differently to make it explicit: a text editor is the wrong tool for the job. You are essentially trying to delete a huge number of records from a database – this could probably be done in a database application, and it could be easily done in a few lines of code with a good programming language – but we cannot help you with either database or programming solutions here, because this forum is about Notepad++.

                1 Reply Last reply Reply Quote 1
                • Alan KilbornA
                  Alan Kilborn @PeterJones
                  last edited by

                  @PeterJones said in How to delete all lines found in another txt document:

                  but even a guru can sometimes make mistakes or not explain things well (especially when they are communicating technical information in a language other than their native language)

                  I believe the … was supposed to indicate that there could be more beyond the initial three equals symbols.

                  And I think that the posters receiving information need to actually do some THINKING about what they’re being given…

                  DimakSerpgD 1 Reply Last reply Reply Quote 0
                  • DimakSerpgD
                    DimakSerpg @Alan Kilborn
                    last edited by

                    @PeterJones I updated notepad, thought it might help, and now there’s error.error

                    @Alan-Kilborn said in How to delete all lines found in another txt document:

                    And I think that the posters receiving information need to actually do some THINKING about what they’re being given…

                    Uhh… no? It’s pretty simple.

                    1. do this
                    2. then this
                    3. done
                      I don’t need to know exactly what this command means, I don’t need to learn regex for this. It’s a simple command that would work as it is, but the problem is on my side because of the large text.
                    PeterJonesP 1 Reply Last reply Reply Quote 0
                    • guy038G
                      guy038
                      last edited by guy038

                      Hello, @dimakserpg and All,

                      Could you provide us a small part of your source.txt and delete.txt ( let"s say about 50 lines of each ) ?

                      Try to insert these sections as raw text, using the </> button when writing your post !

                      I will try to find out a new method, suitable for big files !

                      Best Regards,

                      guy038

                      BTW, in my regex, I used this part ^===+\R which represents a complete line of, at least, 3 equal signs. followed with its line-break

                      Thus, as long as this line begins with ===, it doesn"t matter if more equal signs are written right after !

                      DimakSerpgD 1 Reply Last reply Reply Quote 0
                      • PeterJonesP
                        PeterJones @DimakSerpg
                        last edited by

                        @DimakSerpg said in How to delete all lines found in another txt document:

                        Uhh… no? It’s pretty simple.

                        If it were simple, you would’ve figured it out without help.

                        I don’t need to know exactly what this command means, I don’t need to learn regex for this.

                        That’s a poor attitude. So essentially you are saying, “I don’t need to learn because I can dupe other people into doing it for free for me”. See how much help you receive if you continue with that attitude in life. I have already given you a working solution for reasonable quantities of data, and given you alternate suggestions of non-Notepad++ ideas that you might want to pursue; after this post, I’ve had my say.

                        I notice you also didn’t bother showing any example data, like I requested. And now Guy has requested it as well. If you don’t at least put in that much thought and effort, it will be virtually impossible for someone to help you, even if they were willing to look beyond your attitude.

                        It’s a simple command that would work as it is, but the problem is on my side because of the large text.

                        It’s not a simple command, but it does work correctly with smaller datasets.

                        ----

                        Please note: This Community Forum is not a data transformation service; you should not expect to be able to always say “I have data like X and want it to look like Y” and have us do all the work for you. If you are new to the Forum, and new to regular expressions, we will often give help on the first one or two data-transformation questions, especially if they are well-asked and you show a willingness to learn; and we will point you to the documentation where you can learn how to do the data transformations for yourself in the future. But if you repeatedly ask us to do your work for you, you will find that the patience of usually-helpful Community members wears thin. The best way to learn regular expressions is by experimenting with them yourself, and getting a feel for how they work; having us spoon-feed you the answers without you putting in the effort doesn’t help you in the long term and is uninteresting and annoying for us.

                        1 Reply Last reply Reply Quote 1
                        • DimakSerpgD
                          DimakSerpg @guy038
                          last edited by

                          @guy038 said in How to delete all lines found in another txt document:

                          Could you provide us a small part of your source.txt and delete.txt ( let"s say about 50 lines of each ) ?

                          @PeterJones said in How to delete all lines found in another txt document:

                          I notice you also didn’t bother showing any example data, like I requested. And now Guy has requested it as well. If you don’t at least put in that much thought and effort, it will be virtually impossible for someone to help you, even if they were willing to look beyond your attitude.

                          149f9210-c899-43a8-a6c0-484d99e9ef93-image.png

                          PeterJonesP 2 Replies Last reply Reply Quote -3
                          • PeterJonesP
                            PeterJones @DimakSerpg
                            last edited by

                            @DimakSerpg ,

                            fca34970-75b6-4cc5-91a9-00a15ecc195d-image.png

                            Did you notice the part where I said, Someone ... could then take those examples, and make huge datafiles – I wasn’t claiming that they would use just the small example; I was saying they needed that small example as a starting point, to try to replicate the problem with the original regex and try to solve it using the extended data.

                            I don’t understand why you are unwilling to provide even that much. Guy has said he’s willing to help you, and all you have to do to receive that help is to provide example data that he can start from. If you choose not to share a small amount of example data, I think even Guy’s willingness to help you will not be able to overcome your lack of effort.

                            1 Reply Last reply Reply Quote 2
                            • PeterJonesP
                              PeterJones @DimakSerpg
                              last edited by PeterJones

                              @DimakSerpg ,

                              REGEX IN NOTEPAD++ IS THE WRONG TOOL FOR THIS JOB!

                              I created three sets of files:

                              1. 100,000 7-digit numbers in each, where it will delete about 1/3 of the ones from source.txt
                              2. 1,000,000 7-digit numbers in each, where it will delete about 1/2 of the ones from source.txt
                              3. 10,000,000 9-digit numbers in each, where it will delete about 1/3 of the ones from source.txt

                              I started notepad++ -nosession -multiInst -noPlugin src1e5.txt del1e5.txt running on the regex for the smallest of those.
                              Then in another Notepad++ session, I spend about 10minutes coding up a script in Perl, and made sure it worked on the 100,000 line file in under a second. It then worked on the 1,000,000 line file in about 4 seconds. And then it processed the 10,000,000 line file in 4 minutes.

                              I then wrote up this post. By the time I was done with that, it still hadn’t finished running the regex in Notepad++.

                              IyFwZXJsDQp1c2UgNS4wMTI7DQp1c2Ugd2FybmluZ3M7DQp1c2Ugc3RyaWN0Ow0KdXNlIFRpbWU6OkhpUmVzIHF3L3RpbWUvOw0KDQpwcmludCBTVERFUlIgc2NhbGFyIHRpbWUsICJcbiI7DQpteSBAc3JjID0gZG8geyBvcGVuIG15ICRmaCwgJzwnLCAnc3JjMWU3LnR4dCc7IDwkZmg+IH07DQpteSBAZGVsID0gZG8geyBvcGVuIG15ICRmaCwgJzwnLCAnZGVsMWU3LnR4dCc7IDwkZmg+IH07DQpteSAlaDsgQGh7QGRlbH0gPSBAZGVsOw0Kb3BlbiBteSAkZmgsICc+JywgJ291dDFlNy50eHQnOw0Kc2VsZWN0ICRmaDsNCiRcID0gIiI7DQpwcmludCBmb3IgZ3JlcCB7IWV4aXN0cyAkaHskX319IEBzcmM7DQpwcmludCBTVERFUlIgc2NhbGFyIHRpbWUsICJcbiI7DQo
                              

                              If you can figure out how to decode that text box using Notepad++, and run a perl script (not in Notepad++), it’s yours, for free, no tech support provided. Good luck,

                              1 Reply Last reply Reply Quote 2
                              • First post
                                Last post
                              The Community of users of the Notepad++ text editor.
                              Powered by NodeBB | Contributors