Good Morning. How to see find and see only the lines duplicated in a file ?. Thanks



  • Good Morning. How to see find and see only the lines duplicated in a file ?. Thanks



  • @Jose-Emilio-Osorio said in Good Morning. How to see find and see only the lines duplicated in a file ?. Thanks:

    How to see find and see only the lines duplicated in a file ?.

    As you haven’t provided much information about all I can offer is that I would do it as follows:

    1. Sort all the lines, this places lines together that are the same, or even just similar, starting from the first character.
    2. Use a regex (regular expression) to find those duplicate lines. Now depending on what you actually want I would either mark those duplicates and cut (or copy) them to another file. Alternatively remove all lines where there was ONLY 1 instance of that (not a duplicate), this would be a destructive search so should be done on a duplicate of thw file, not the original.

    Terry

    Terry



  • @Terry-R
    Thanks for your quick response.
    I use the following regex but no delete the line: ^(.?)$\s+?^(?=.^\1$)
    Just mark them



  • @Jose-Emilio-Osorio said in Good Morning. How to see find and see only the lines duplicated in a file ?. Thanks:

    I use the following regex but no delete the line

    You didn’t originally say your goal was deletion.
    Since that is now the case, why not use Edit menu’s Remove Duplicate Lines command (under Line Operations submenu)?

    Using regex to delete duplicate lines on an unsorted file is problematic and not advised (unless you want to go through a lot of complicated steps).



  • @Terry-R
    Sorry for the my bad question.
    I want to delete the lines duplicated and I use the following regex but no delete the line: ^(.?)$\s+?^(?=.^\1$).
    But when I search again just to confirm, the lines are not deleted.
    Version 7.9.2



  • Hola @Jose-Emilio-Osorio

    Could you provide an example? Because at first glance I don’t think the posted regex can match consecutive duplicated lines. Maybe the issue is in the specific regular expression you wrote.

    Now, if the regex fits your needs, then, as it is said above, in order to eliminate the duplicated lines just leave the replacement field empty, as follows:

    Search: ^(.?)$\s+?^(?=.^\1$)
    Replace: [leave empty]
    

    Take care and have fun!



  • @astrosofista

    Do what you will, but I’m not sure why anyone would pursue removal of duplicate lines with regular expression replacement anymore.



  • @Alan-Kilborn said in Good Morning. How to see find and see only the lines duplicated in a file ?. Thanks:

    @astrosofista

    Do what you will, but I’m not sure why anyone would pursue removal of duplicate lines with regular expression replacement anymore.

    In order to get a better comprehension of the issued at hand, I read OP’s previous post and it seems that he wants or needs to check the outcome before to delete the unwanted lines. Or maybe he is unaware of the new functionality.

    Does it make sense?



  • @astrosofista
    Correct.
    I want and need to check the outcome before to delete the unwanted lines.
    Sorry of not explain better what I really need.



  • @Jose-Emilio-Osorio said in Good Morning. How to see find and see only the lines duplicated in a file ?. Thanks:

    I want and need to check the outcome before to delete the unwanted lines.

    From your posts thus far it would appear that English is NOT your primary language.

    Given that, it makes more sense that if we are to help you we should ask the questions we think need answering so we can help.

    As @Alan-Kilborn stated using a regex to find duplicates on unsorted lines can be problematic. The issue is that the regex engine can be overwhelmed with a lookahead (such as your (?=.\1$)) if it needs to search throughout the remainder of the file and there are many lines.

    The normal acceptable method of doing this to mark (as you suggest you need to check those duplicates before removing) the lines is to sort the lines so that duplicate lines appear together. Often this means additional steps to insert line numbers, sort, mark/remove duplicates, then re-sort by the line numbers to get back to the original format and then remove the line numbers. Here is another post where we did exactly this.
    https://community.notepad-plus-plus.org/topic/20173/delete-near-duplicate-lines
    Unfortunately those exact steps will not work for you but the process is the same.

    So questions are:

    1. If a duplicate exists which line needs to be marked, first (original) or second (duplicate)?
    2. Can there be more than 2 lines that can be the same? So original, duplicate #1 and duplicate #2.
    3. How many lines exist in the file and what is the approximate size (in characters)?
    4. Are you OK with using the multiple steps as suggested above? The result will still be that the lines are back in the correct (original) order.
    5. If using the “Mark” function to identify the duplicates, then either you would need to verify and delete before putting lines back in original order, or further mark (add text to) those lines so you can see them in their original location prior to deleting. This is because the sort function removes the marks (just tested the process to confirm this). Can you verify the (duplicate) lines out of original order or do you need to see the line in it’s original location within the file before removing it?

    Terry



  • I like regex but I still would avoid it for this, due to the complications involved. Here’s what I would do:

    • obtain a file compare utility if you don’t have one (the Compare plugin for Notepad++ is a suitable choice)
    • make a copy of your original file
    • in the copy, do Remove Duplicate Lines, then save the file
    • use the compare utility to compare the original file with the now-modified copy; this will show you the duplicates rather easily and you can decide where to go from there


  • @Terry-R
    Good Morning Terry,
    Answers:

    1. Second duplicate #2
    2. There can be more than 2 duplicate lines. It’s a possibility, but I haven’t had that problem yet.
    3. Lines = 98252; Length = 6484502; Col = 64
    4. It can be in ascending order. I don’t need it in the original order.
    5. I just need to verify which are the duplicates before they are deleted. If the file is sorted in ascending order, it can look like this. That way it would be easier to review and delete later. It does not need to be in the original position.

    Thanks for your help.



  • @Jose-Emilio-Osorio said in Good Morning. How to see find and see only the lines duplicated in a file ?. Thanks:

    I just need to verify which are the duplicates before they are deleted.

    Since you say that you aren’t interested in getting the lines back to the original order it makes the solution much easier. It also means that you should not be concerned about which line is actually marked for deletion (question #1).

    So the first step is to sort the file. You can select either ascending or descending lexicographical order, it makes no difference, except to the visual aspect (possibly easier on the eye when sorted ascending). This is achieved by selecting the “Sort Lines lexicographically Ascending” or “Sort Lines lexicographically Descending” which is under “Line Operations”, under the Edit menu.

    Next we just need to mark (bookmark) the lines which are duplicates. This is achieved by using the “Mark” function, under “Search” menu.
    Find What:(?-s)^(.+)\R(?=\1)
    Make sure "Bookmark line is ticked. The cursor should be on the first line of the file, otherwise some duplicates might get missed. Wrap around should not be ticked (as we don’t want to consider the last line as a duplicate of the first). Search Mode must be “regular expression”. Click on the “Mark All” button.

    The file will now show a number of lines that have a mark in the left margin, normally a blue circle. You can use “F2” or “Shift-F2” to move to the next (or previous) bookmarked line for viewing. These bookmarks might be erased if performing some more operations on the file, such as another sort. If this happens just run the Mark operation again.

    To delete the bookmarked lines use the “Remove Bookmarked Lines” listed under “Bookmark” which is under the “Search” menu.

    Terry


Log in to reply