Using latest version 7.5 64bit. How to remove duplicte Lines



  • Greetings, New forum user here.

    I am using the latest version 7.5 64 BIt. I have exceedingly large text files and need to remove duplicate entries.

    How is this accomplished?

    Thanks,
    Brad



  • Welcome to the forum! We’re a helpful community when possible, but we need a little more information.

    1. Define exceedingly large
      • Give us a ballpark figure - even 100 KB could be considered large for a text file, or 10 MB, 5 GB, ???
      • Some users have reported having difficulty even opening files over 2 GiB
    2. Define duplicate entries
      • Are they always a single line duplicated, multiple duplicate lines, etc.?
    3. Provide some example duplicate entries
      • We have some awesome find-and-replace experts on this forum, but without examples, having the duplicate sections clearly identified, it’s more difficult to come up with a solution.
      • If you can’t provide the examples (confidential, for example), provide us with similar text.

    More effort put into defining the problem and communicating it, will often reap rewards multiple times the effort!



  • I am having the same challenge. I am on version 7.5 64-bit.

    I cannot figure out how to simply remove duplicate records.

    All of google searching shows to use TextFX. However, TextFX does not work with 7.5 64-bit.

    I had 250 rows of data, that were just order numbers. For example,
    0040067
    0040134
    0040134
    0040134
    0040134
    0040271
    0040271

    I wanted to remove the duplicates / get the distinct values. Searched google for quite some time and could not find an answer. Ended up copying into Excel and removing duplicates from there.



  • Hello, @matt-czugała and All,

    I was wondering, on which kind of data, you needed the suppression of duplicate records, Luckily, with the @glennfromiowa advices, you’re just saying that your text is already sorted ! In that case, it’s very easy to remove duplicate records, indeed !!

    • Open your file in N++

    • If necessary, add a line-break, after the last line of your file

    • Open the Replace dialog Ctrl + H

    • Type in (?-s)^(.+\R)\1+ , in the Find what: zone

    • Type in \1 , in the Replace with: zone

    • Check the two options Wrap around and Regular expression

    • Click on the Replace All button

    Et voilà !

    So, from your example text :

    0040067
    0040134
    0040134
    0040134
    0040134
    0040271
    0040271
    

    you get, after replacement :

    0040067
    0040134
    0040271
    

    Notes :

    • First, the syntax (?-s) forces the regex engine to interpret any dot . as a single standard character

    • Then, after, any beginning of line ^, the part (.+\R) matches all standard characters of current line, followed by its End of Line character(s)

    • As enclosed in parentheses, the entire line is stored as group 1

    • Then the regex engine tries to match the greatest non-null number of the previous line \1

    • In replacement, all that block of lines is just replaced by the single line \1

    Best Regards,

    guy038



  • Thanks Guy, that appears to be working for me.
    I feel like I’m going to have to bookmark this page and come back often to remove duplicates. Hopefully TextFX or something similar can be updated for this latest version. It was so much easier to do without having to type in codes and being able to click my way to what I needed.



  • @guy038 That is a killer regex, guy! Love it! Of course, I wish there was a simple “unique” setting like the old way. But your regex is so awesome, I think it will stick in my brain. Ever since I’ve found out about regex search and replace in npp, I use it ALL THE TIME! Which reminds me of a wish list I have for npp: Be able to “name” searches like this and save them in a search library, similar to a macro library, so you don’t have to remember it each time (and don’t want to carry around a separate note file to pull them out of).



  • @guy038 awesome. Thanks, guy038



  • What would the regex be if the duplicate line is indented 4 spaces and not in line like the example above?



  • @Scott-Fredrick-Smith
    I’m not on a PC at present to confirm but I think you would adjust the find what regex to be
    (?-s)^(.+\R)\h{4}\1+
    \h refers to a space and 4 in brackets means 4 of. Note the replace stays the same which returns the first line.

    Try that and let us know.

    Terry



  • @Terry-R said:

    (?-s)^(.+\R)\h{4}\1+

    Yes, Terry-R that worked beautifully! Awesome! Thanks for the help!



  • hi @Terry-R @guy038 and all

    now this produced a question that interests me, but i’m not able to solve it 😳🤔🤭

    suppose we got data like that (but without the remarks):

    0040067
    0040134
        0040134			// indented 4 spaces
        0040134			// indented 4 spaces
    	0040134			// indented 1 tab
    		0040134		// indented 2 tabs
    0040271
    0040271
    

    with (?-s)^(.+\R)\h{4}\1+ we get this result, as it only covers one single following duplicate, and only if it is indented exactly 4 spaces to the previous, which has to be non indented:

    0040067
    0040134
        0040134
    	0040134
    		0040134
    0040271
    0040271
    

    if we needed to eliminate all duplicate numbers, without trimming the document first, regardless of it’s level of indentation.
    could this be possible using regex ?



  • Hello, @meta-chuh, @terry-r and All,

    No problem ;-))

    So, assuming your data, WITHOUT the comments and their leading spaces

    0040067
    0040134
        0040134           // indented 4 SPACES
        0040134           // indented 4 SPACES
    	0040134           // indented 1 TAB
    		0040134       // indented 2 TABS
    0040271
    0040271
    

    The S/R regex :

    SEARCH (?-s)^\h*(.+\R)(\h*\1)+

    REPLACE \1

    does the job and you get :

    0040067
    0040134
    0040271
    

    If we start, for instance, with the text below ( again, WITHOUT any comment ) :

    0040067
             0040134    // indented 8 SPACES
        0040134         // indented 4 SPACES
    	0040134         // indented 1 TAB
    		0040134     // indented 2 TABS
    0040271
    0040271
    

    We get the same result :

    0040067
    0040134
    0040271
    

    However, we should remember that the generic S/R (?-s)^(.+\R)\1+ is supposed to be run on pre-sorted data ! So, @meta-chuh, if we sort your data, we obtain :

    		0040134    // indented 2 TABS
    	0040134        // indented 1 TAB
        0040134        // indented 4 SPACES
        0040134        // indented 4 SPACES
    0040067
    0040134
    0040271
    0040271
    

    Now, applying my regex S/R against this sorted text, ( still WITHOUT the comments ), would result in :

    0040134
    0040067
    0040134
    0040271
    

    Unfortunately, we see that there still are two occurrences of the 0040134 number ! So, finally, the best would be :

    • First, get rid of all leading blank characters, with the regex : SEARCH ^\h+ and REPLACE EMPTY

    • Perform an ascending alphabetic sort

    • Run the generic S/R :

    SEARCH (?-s)^(.+\R)\1+

    REPLACE \1

    Cheers,

    guy038



  • @Matt-Czugała TextFX 64 bit version is now available as a direct download from developer’s site.



  • It appears that an upcoming release of Notepad++ is gonna have a Remove Duplicate Lines menu command, but it appears crippled (or maybe just poorly named) as it will only remove duplicate lines that are on lines right next to each other. Okay, so that’s a nice functionality, but it isn’t gonna be (all of) what people want/expect in such a command… And I suspect people trying to help out with support will see questions about it, a lot.



  • @V-S-Rawat

    i hope that @Matt-Czugała will return to see this in 2020 or so 😉

    reader’s note: it’s not the original textfx developer that compiled textfx to x64.

    quote from @chcg in a recent post:

    If you like experiments, you might want to test
    https://github.com/HQJaTu/NPPTextFX/blob/VS2017-x64/bin/x64/NppTextFX.dll
    for x64.

    here’s hqyatu’s github page for textfx x64 in case anyone wants to have a look around (entry point: x64 dll binary download page): https://github.com/HQJaTu/NPPTextFX/tree/VS2017-x64/bin/x64



  • @guy038 said:

    0040067
    0040134 // indented 8 SPACES
    0040134 // indented 4 SPACES
    0040134 // indented 1 TAB
    0040134 // indented 2 TABS

    Ok, let’s say that I have this scenario.

    0040134 DRAWING TITLE 1
    0040134 DRAWING TITLE 1A - (SLIGHTLY MODIFIED TITLE)
    0040134 DRAWING TITLE 1
    0040135 DRAWING TITLE 2
    0040135 DRAWING TITLE 2A - (SLIGHTLY MODIFIED TITLE)

    Trying to figure out a Regex that will only match the number and ignore the title on the same line, but delete the line below with the slightly modified title, and where it still retains the first line.

    0040134 DRAWING TITLE 1
    0040135 DRAWING TITLE 2



  • @Scott-Fredrick-Smith ,

    It would have been nice if you’d included what regex you tried. Showing effort your effort lets the helpers know that you’re willing to put in effort – plus it makes the helpers’ jobs easier, since they know what’s already been tried and failed, and can get a glimpse as to what knowledge you already have in the domain.

    Assuming an expanded data set,

    0040134 DRAWING TITLE 1
    0040134 DRAWING TITLE 1A - (SLIGHTLY MODIFIED TITLE)
    0040134 DRAWING TITLE 1
    0040999 DRAWING TITLE 3
    0040999 SUPER MODIFIED TITLE
    0040999 THIRD TITLE FOR SAME NUMBER
    0040135 DRAWING TITLE 2
    0040135 DRAWING TITLE 2A - (SLIGHTLY MODIFIED TITLE)
    

    … that you want to convert to

    0040134 DRAWING TITLE 1
    0040999 DRAWING TITLE 3
    0040135 DRAWING TITLE 2
    

    Note: Assumptions

    • no whitespace before the number at the beginning of the line
    • any “near duplicates” would be on line(s) immediately following, and you wanted to delete them even when there’s more than one duplicate, keeping only the first.
    • “slightly modified” meant that anything could come on the line after the number and it would still match as long as the number was the same; “slightly modified” is too ambiguous for a regular expression. (If you really want to match anything that has a heuristic to determine the “distance” between two strings, like the Levenshtein distance, you will have to use a programming language).

    I used

    • Search Mode = regular expression
    • Find = (?-s)^(\d+\b)(.*\R)(\1.*(?:\Z|\R))+
    • Replace = \1\2

    Quick explanation:

    • (?-s) = turn off .-matches-newline
    • ^ = the match starts at the beginning of the line
    • (\d+\b) = match one or more digits (ending in a “boundary”, which is a zero-width transition from numbers to non-numbers) and store in the first numbered group, which can be referenced later as \1 (or $1)
    • (.*\R) = match any characters (.*) coming after the number, through the EOL sequence (\R) for that line, and store in group \2
    • (\1.*(?:\Z|\R))+ = complicated.
      • (...)+ = match one or more lines that meet the condition inside the parens; it will store it in a group, though we aren’t using this group later
      • \1 = it will start with the same number as was matched on the first line.
      • .*(...) = followed by zero or more characters, followed by a sequence defined inside the parens
      • (?:...) = this inner group won’t be saved to a numbered group
      • ...|... = the left side or the right side will match
      • \Z = match the end of the document (if the last line doesn’t have a newline sequence, we still want it to match)
      • or \R = match the EOL newline sequence for that row

    And replace with \1\2 which means the contents of the first two parenthesis-groups. In other words, the number plus the remaining contents of the first line with that number.

    If this isn’t quite right, you will have to give a more-complete example, which shows instances which should be changed and which shouldn’t, including anything that I got wrong above. You will also need to try to fix the regex yourself, and show us the modified regex you tried, and why you tried it (what you thought it would do), and show us the results that it gets you, compared to the results you wanted. See FYI below for more info on how to format your example text so it isn’t lost in translation, and where to go for regex documentation.

    -----
    FYI:

    This forum is formatted using Markdown, with a help link buried on the little grey ? in the COMPOSE window/pane when writing your post. For more about how to use Markdown in this forum, please see @Scott-Sumner’s post in the “how to markdown code on this forum” topic, and my updates near the end. It is very important that you use these formatting tips – using single backtick marks around small snippets, and using code-quoting for pasting multiple lines from your example data files – because otherwise, the forum will change normal quotes ("") to curly “smart” quotes (“”), will change hyphens to dashes, will sometimes hide asterisks (or if your text is c:\folder\*.txt, it will show up as c:\folder*.txt, missing the backslash). If you want to clearly communicate your text data to us, you need to properly format it.
    If you have further search-and-replace (“matching”, “marking”, “bookmarking”, regular expression, “regex”) needs, study this FAQ and the documentation it points to. Before asking a new regex question, understand that for future requests, many of us will expect you to show what data you have (exactly), what data you want (exactly), what regex you already tried (to show that you’re showing effort), why you thought that regex would work (to prove it wasn’t just something randomly typed), and what data you’re getting with an explanation of why that result is wrong. When you show that effort, you’ll see us bend over backward to get things working for you. If you need help formatting, see the paragraph above.
    Please note that for all regex and related queries, it is best if you are explicit about what needs to match, and what shouldn’t match, and have multiple examples of both in your example dataset. Often, what shouldn’t match helps define the regular expression as much or more than what should match.



  • there i have a plugin submitted for plugin admin 32bit
    Remove_dup_lines

    download link



  • @gurikbal-singh

    it says “all checks have failed” if you go to your link https://github.com/notepad-plus-plus/nppPluginList/pull/59

    what’s the difference to the built in Add “Remove Duplicate Lines” feature seen at following commit ?
    https://github.com/notepad-plus-plus/notepad-plus-plus/commit/51f10bdba56a415d42eb829b27a08955cb7db0dd




Log in to reply