Find and Display *All* Duplicate Lines
-
@Yaron said in Find and Display *All* Duplicate Lines:
How can I find duplicate lines and display all occurrences of those lines?
That is: if I have 2 identical lines, I’d like the 2 occurrences to be displayed in the Search Results window.
I believe there is no straightforward way to do this with built-in Notepad++ tools. The problem is a technical one with regular expressions: look-backs must be fixed length. You can match a line and look forward to see if the same thing can be matched later in the file, but looking at the last occurrence, it’s not possible to look backward to see if there was an earlier one.
That doesn’t mean the problem can’t be solved; it just won’t be a simple, one-step thing, and it will take some inventiveness. It’s hard to make a good suggestion without knowing a little more about the context:
What is the practical problem you are trying to solve?
What sort of data (size, format) is involved?
Is this a one-time problem, or something you’ll need to do often?
What is your skill set? (Are you a programmer? Are you familiar with command-line tools? Do you happen to know Python? — there is a Python scripting plugin for Notepad++ and several folks here who know a lot about how to use it.)
Sorry to have to answer with questions, but I think the only way we can help you is to know a little more about what you need to accomplish.
-
Thank you both for the detailed and helpful replies. I appreciate it.
I was playing with NPP “Remove Duplicate Lines” commands, and wondered if it was possible to see those “Duplicate Lines” before removing them.
A built-in command for that might be useful.
I’m not interested in a Python script. -
@Yaron said in Find and Display *All* Duplicate Lines:
That is: if I have 2 identical lines, I’d like the 2 occurrences to be displayed in the Search Results window.
A built-in command for that might be useful.
It sounds like a new feature request, details on providing said request found HERE. :-)
I must say that I do like the idea; an additional command button for it could possibly be added on the Find window?
If a line differs from another line only in its line-ending character(s), is it still a duplicate line? If not, can the final line, when it doesn’t have a line-ending, ever be a duplicate of another line?
I presume lines with no non-line-ending content (empty lines) would not be considered?
So many questions…
-
-
Another question:
Should duplicate line sets appear adjacent in the output, even if they are interleaved in the original source? (I’d think so)
Thus:
A B A C A B C
would appear in the output as:
A A A B B C C
-
An alternative almost-solution to the original need of
How can I find duplicate lines and display all occurrences of those lines?
could be to:
- save the file (
FILE1
) - use the Remove Duplicate Lines command
- do File > Save a Copy As… to obtain
FILE2
- undo the Remove Duplicate Lines command
- use a file-compare utility/plugin on
FILE1
andFILE2
, which will create an effective “display” of the duplicated lines
- save the file (
-
Hi, @yaron, @coises, @mkupper, @alan-kilborn and All,
@alan-kilborn, in the meanwhile, I saw your post and just look at my comment, at the end of this post !
@yaron, I’ve tried to think about your problem but cannot find an easy solution :-(( I mean… with *only the N++ fatures !
To begin with, I suppose that you want to get all the lines which are the duplicate of some other lines AND also all the original lines, too !
Let’s start with a very simple example :
From my old
XP
machine :-
I pasted the
39
enhancement and bug fixes of the N++v7.9.2
change.log file, without their leading numbers, in a file namedDup.txt
-
I added
7
duplicates lines to get a46
-lines file ( so,34
single lines +3
lines in2
ex. +2
lines in3
ex. ) -
Then I modified the order of some lines, using the
Ctrl + Shift + Up arrow
and theCtrl + Shift + Down arrow
shortcuts ! -
Finally, in order to mimic the
Search results
panel appearance, I added the part\tLine\x20\d+:\x20
in front of each line.
Thus, I ended up with the following
Dup.txt
file contents :Line 1: Fix regression of auto-Indent. Line 2: Add custom URI schemes ability. Line 3: Fix Search result's text direction (RTL) not always synchronized with main edit zone's one issue. Line 4: Improve URL parser: fix apostrophe in an URL issue. Line 5: Add context menu with "Copy link" ability. Line 6: Add color samples on menu items for styling features. Line 7: Add "-settingsDir" argument for overriding default settings path. Line 8: Fix crash while exit command issued by plugin. Line 9: Fix several bugs of PHP parser rule for function list. Line 10: Fix find in files failure issue due to directory path with leading/trailing spaces. Line 11: Move "Normal Text" to top in Languages Menu. Line 12: Add new API NPPM_GETSETTINGSONCLOUDPATH for plugins. Line 13: Add an option for displaying constant line number width. Line 14: Fix function list is empty with new user profile in the same PC issue. Line 15: Fix dockable panels display issue in RTL direction. Line 16: Fix single-quoted string being badly recognized as attribute value in XML. Line 17: Fix docked panels appear with "-nosession" cmd line parameters. Line 18: Improve text selection after Replace All In Selection operation. Line 29: Add the number of total documents on windows dialog's title bar. Line 20: Fix scroll to last line problem after main window resizing. Line 21: Fix Plugin admin display UTF-8 issue in its description. Line 22: Fix Search result's text direction (RTL) not always synchronized with main edit zone's one issue. Line 23: Fix Search result line number highlighting inaccurate issue. Line 24: Make "Line" preceding each line number on Search Results translatable. Line 25: Fix menu check marks not being removed after closing "Clipboard History" and "Character Panel" panels. Line 26: Prevent corruption possibility when using -p command line parameter in a UTF file. Line 27: Fix command line arguments -p, -n & -c negative value's undefined behaviour. Line 28: Add new Margin/Border/Edge sub-page in Preferences. Line 29: Fix find in files failure issue due to directory path with leading/trailing spaces. Line 30: Fix folder icon display issue in "Folder as Workspace" after "Expand/Collapse All". Line 31: Make "Clipboard History" and "Character Panel" togglable. Line 32: Fix Find in found results dialog launch failure after macro execution. Line 33: Fix Search result's text direction (RTL) not always synchronized with main edit zone's one issue. Line 34: Disallow Goto dlg offset option from moving to position inside multi-byte char or between CR and LF. Line 35: Fix "Go to..." dialog wrong Offset value in empty files. Line 36: Prevent corruption possibility when using -p command line parameter in a UTF file. Line 37: Improve indent guidelines on non-brace control block languages. Line 38: Prevent names of untitled tabs from duplication. Line 39: Add tooltips for Folder as Workspace 3 commands. Line 40: Fix find in files failure issue due to directory path with leading/trailing spaces. Line 41: Fix "SCI_NEWLINE" inside a macro not working issue. Line 42: Improve URL parser: fix apostrophe in an URL issue. Line 43: Fix bug where search-results won't open 'new 1' file. Line 44: Fix tab close button remain pushed issue. Line 45: Enhance ghost typing command line argument feature - using white space directly instead of %20. Line 46: Fix dockable panels display issue in RTL direction.~~~
And, here is the picture showing the
5
lines with duplicates and the7
duplicated lines
Presently, if we use the following regex search :
SEARCH
(?-is)^\tLine \d+: (.+)\R(?=(?s).*?^\tLine \d+: \1(?:\R|\z))
We get this Search results panel :
Search "(?-is)^\tLine \d+: (.+)\R(?=(?s).*?^\tLine \d+: \1(?:\R|\z))" (7 hits in 1 files of 1 searched) D:\@@\792\Dup.txt (7 hits) Line 3: Fix Search result's text direction (RTL) not always synchronized with main edit zone's one issue. Line 4: Improve URL parser: fix apostrophe in an URL issue. Line 10: Fix find in files failure issue due to directory path with leading/trailing spaces. Line 15: Fix dockable panels display issue in RTL direction. Line 22: Fix Search result's text direction (RTL) not always synchronized with main edit zone's one issue. Line 26: Line 26: Prevent corruption possibility when using -p command line parameter in a UTF file. Line 29: Line 29: Fix find in files failure issue due to directory path with leading/trailing spaces.
And you, you would like this kind of output :
Search "(?-is)^\tLine \d+: (.+)\R(?=(?s).*?^\tLine \d+: \1(?:\R|\z))" (12 hits in 1 files of 1 searched) D:\@@\792\Dup.txt (12 hits) Line 3: Fix Search result's text direction (RTL) not always synchronized with main edit zone's one issue. Line 4: Improve URL parser: fix apostrophe in an URL issue. Line 10: Fix find in files failure issue due to directory path with leading/trailing spaces. Line 15: Fix dockable panels display issue in RTL direction. Line 22: Fix Search result's text direction (RTL) not always synchronized with main edit zone's one issue. Line 26: Prevent corruption possibility when using -p command line parameter in a UTF file. Line 29: Fix find in files failure issue due to directory path with leading/trailing spaces. Line 33: Fix Search result's text direction (RTL) not always synchronized with main edit zone's one issue. Line 36: Prevent corruption possibility when using -p command line parameter in a UTF file. Line 40: Fix find in files failure issue due to directory path with leading/trailing spaces. Line 42: Improve URL parser: fix apostrophe in an URL issue. Line 46: Fix dockable panels display issue in RTL direction.
Or, perhaps, this other one :
Search "(?-is)^\tLine \d+: (.+)\R(?=(?s).*?^\tLine \d+: \1(?:\R|\z))" (12 hits in 1 files of 1 searched) D:\@@\792\Dup.txt (12 hits) Line 3: Fix Search result's text direction (RTL) not always synchronized with main edit zone's one issue. Line 22: Fix Search result's text direction (RTL) not always synchronized with main edit zone's one issue. Line 33: Fix Search result's text direction (RTL) not always synchronized with main edit zone's one issue. Line 4: Improve URL parser: fix apostrophe in an URL issue. Line 42: Improve URL parser: fix apostrophe in an URL issue. Line 10: Fix find in files failure issue due to directory path with leading/trailing spaces. Line 29: Fix find in files failure issue due to directory path with leading/trailing spaces. Line 40: Fix find in files failure issue due to directory path with leading/trailing spaces. Line 15: Fix dockable panels display issue in RTL direction. Line 46: Fix dockable panels display issue in RTL direction. Line 26: Prevent corruption possibility when using -p command line parameter in a UTF file. Line 36: Prevent corruption possibility when using -p command line parameter in a UTF file.
But, in this case, you’re breaking down the classical
Search results
panel appearance as the lines are not anymore displayed in natural order !Why, @yaron, your problem is particularly difficult ?
Well, as you want to immediately see the results in the Search result` panel, this implies that the current file must not be changed, in any way, else, the leading numbering would necessarily be wrong :-((
Because of the length of this post, continue to the next one !
guy038
-
-
Hi, @yaron, @coises, @mkupper, @alan-kilborn and All,
I thought about an other method but this one would need to add an enhanced regex search mode ! Let me explain : once the search process finds the first occurrence, the search process would consider that the next search should start ONE char AFTER the beginning of the present searched range ( and NOT right AFTER the end of the present searched range ) !
PRESENT INSTALLED search process : SEARCH : •---------------------------#################################------------------#####################################------ ^ FIRST occurrence found | FIRST search start | •---------------------------#################################|-----------------#####################################------ ^ SECOND occurrence found SECOND search start POSSIBLE ADDED search process : SEARCH : •---------------------------#################################------------------#####################################------ ^ | FIRST occurrence found FIRST search start | •----------------------------|----------------------#####################------------------------------------------------- ^ SECOND occurrence found SECOND search start
Let’s experiment this new feature :
-
Open the
Dup.txt
file -
Use, first, the
Search > Bookmark > Clear All Bookmarks
option -
Open the Find dialog
-
SEARCH
(?-is)^\tLine \d+: (.+)\R(?s:.*?)^\tLine \d+: \1(?:\R|\z)
IMPORTANT : This search regex is different from the ones above !
- Click on the
Find Next
button
=> The first occurrence found is a stream multi-lines selection, from line
3
to line22
-
Close the
Find
dialog (ESC
) -
Manually bookmark the first and last line of that first selection, if not already the case
-
Now, using this new enhanced regex mode, the next search would start ONE character ONLY AFTER the present search
-
To simulate this unsupported feature, simply move the caret one more char ( so right between the
TAB
char and theLine 3:
string -
Hit the
F3
key
=> The second occurrence found is a stream multi-lines selection, from line
4
to line42
-
Again, manually bookmark the first and last line of this second selection, if not already the case
-
Again, move the caret just one char after the beginning of line, so right between the
TAB
char and theLine 4:
string -
Hit the
F3
key
=> The third occurrence found is a stream multi-lines selection, from line
10
to line29
-
Again, manually bookmark the first and last line of this third selection, if not already the case
-
Again, move the caret just one char after the beginning of line, so right between the
TAB
char and theLine 10:
string -
Hit the
F3
key
=> The fourth occurrence found is a stream multi-lines selection, from line
15
to line46
-
Again, manually bookmark the first and last line of this fourth selection, if not already the case
-
Again, move the caret just one char after the beginning of line, so right between the
TAB
char and theLine 15:
string -
Hit the
F3
key
=> The fifth occurrence found is a stream multi-lines selection, from line
22
to line33
-
Again, manually bookmark the first and last line of this fifth selection, if not already the case
-
Again, move the caret just one char after the beginning of line, so right between the
TAB
char and theLine 22:
string -
Hit the
F3
key
=> The sixth occurrence found is a stream multi-lines selection, from line
26
to line36
-
Again, manually bookmark the first and last line of this sixth selection, if not already the case
-
Again, move the caret just one char after the beginning of line, so right between the
TAB
char and theLine 26:
string -
Hit the
F3
key
=> The seventh occurrence found is a stream multi-lines selection, from line
29
to line40
-
Again, let’s bookmark the first and last line of this seventh selection, if not already the case
-
Again, move the caret just one char after the beginning of line, so right between the
TAB
char and theLine 29:
string -
Hit the
F3
key
=> This time, no more occurrence is found
-
We would get a file containing
10
bookmarked lines -
Finally, with the
Search > Bookmark > Copy Bookmarked Lines
option and a paste action in a new tab, we xould end up with :
Line 3: Fix Search result's text direction (RTL) not always synchronized with main edit zone's one issue. Line 4: Improve URL parser: fix apostrophe in an URL issue. Line 10: Fix find in files failure issue due to directory path with leading/trailing spaces. Line 15: Fix dockable panels display issue in RTL direction. Line 22: Fix Search result's text direction (RTL) not always synchronized with main edit zone's one issue. Line 26: Prevent corruption possibility when using -p command line parameter in a UTF file. Line 29: Fix find in files failure issue due to directory path with leading/trailing spaces. Line 33: Fix Search result's text direction (RTL) not always synchronized with main edit zone's one issue. Line 36: Prevent corruption possibility when using -p command line parameter in a UTF file. Line 40: Fix find in files failure issue due to directory path with leading/trailing spaces. Line 42: Improve URL parser: fix apostrophe in an URL issue. Line 46: Fix dockable panels display issue in RTL direction.
BTW, the @alan-kilborn solution, using the
ComparePlus
plugin, seems interesting. After following his instructions, with my own set of data, I got this screen :And, unfortunately, it does not tell us anything about the
5
lines3
,4
,10
,15
and26
, of the original file, which have duplicates !
In summary, a script could probably find all lines with their duplicate ones, of current file but it would not use the search results` panel at all !
However, if we consider the current file in the main view and the list of the duplicate lines, found by the script, like the above one, in the secondary view, I suppose that an automatic jump to a specific duplicate line could still be possible !
A lot of work, anyway, beyond my own skills !!
Best Regards
guy038
-
-
@Alan-Kilborn said in Find and Display *All* Duplicate Lines:
An alternative almost-solution to the original need of
How can I find duplicate lines and display all occurrences of those lines?
could be to:
- save the file (
FILE1
) - use the Remove Duplicate Lines command
- do File > Save a Copy As… to obtain
FILE2
- undo the Remove Duplicate Lines command
- use a file-compare utility/plugin on
FILE1
andFILE2
, which will create an effective “display” of the duplicated lines
That would fail to show the first of each set of duplicate lines.
- save the file (
-
@Yaron said in Find and Display *All* Duplicate Lines:
That is: if I have 2 identical lines, I’d like the 2 occurrences to be displayed in the Search Results window.
Something to note is that after you do a Find All, as soon as you add or remove characters, everything in the search results window following the point of the modification no longer addresses the correct position. So though you might be able to see all the duplicates if this function existed, you still wouldn’t be able to easily do things like browse through and decide which ones from of a set of duplicates should be removed, or add a suffix to an identifier in some duplicates that should not be removed, without things becoming confusing very rapidly.
-
@Alan-Kilborn said in Find and Display *All* Duplicate Lines:
If a line differs from another line only in its line-ending character(s), is it still a duplicate line? If not, can the final line, when it doesn’t have a line-ending, ever be a duplicate of another line?
The current
Remove Duplicate Lines
menu option also looks at the line ending character(s).I constructed a test using:
a<crlf> b<crlf> c<crlf> <crlf> a<cr> b<cr> c<cr> <cr> a<lf> b<lf> c<lf> <lf>
Nothing was removed from that set of lines. I also removed the two final
<lf>
, tried Remove Duplicate Lines, and nothing was removed.I was thinking that a “Show Duplicate Lines” tool would use a regexp pattern of
(?-si)^.+
as I suspect most people are not interested in duplicate empty lines and likely don’t care about the line ending.The existing Remove Duplicate Lines seems to use
(?-si)^.*\R
meaning it’s a case-sensitive compare that also looks at the line ending.I thought about making the pattern user configurable but that could lead to confusion about if the tool would inspect just the matched patterns for duplicates or the entire lines containing those patterns.
-
Something to note is that after you do a Find All, as soon as you add or remove characters, everything in the search results window following the point of the modification no longer addresses the correct position.
The long-ago discussed workaround is to start at the bottom of the Search results list, and work your way towards the top.
An alternative almost-solution
That would fail to show the first of each set of duplicate lines.
And, unfortunately, it does not tell us anything about the 5 lines 3, 4, 10, 15 and 26, of the original file, which have duplicates !
Which is why I dubbed it (file compare) an almost solution.
The current Remove Duplicate Lines menu option also looks at the line ending character(s)
Well, since we’re in “fantasy land”, scoping out a proposed new feature, we can take some liberty and define its best way of working.
-
For the truly stubborn (and I duck my head when I say that), there is a solution of sorts:
Throughout, beware of the match case setting, if it matters for your purpose. Also, the following will not work if your text contains the sequence \E anywhere. Edit: @Alan-Kilborn has pointed out a character count limitation which will also cause failure.
First, open Search | Mark…; enter:
Find what :^([^\r\n]++)(?=[\s\S]*?^\1$)
and then click Mark All, Copy Marked Text and Clear all marks. Close the dialog.Open a new tab and paste, then use Edit | Line Operations | Remove Duplicate Lines.
Type
^(\Q
at the beginning of the first line. Go to the end of the document, backspace to delete the empty line, and type\E)$
at the end of the last, non-empty line. Do not type Enter after this sequence.Open Search | Replace…; enter:
Find what :\R
Replace with:\\E\|\\Q
and click Replace All.Close the dialog, then Select All and Copy.
Return to the original document and open Search | Find…. Paste into the Find what box and click Find All in Current Document.
-
@Coises said in Find and Display *All* Duplicate Lines:
Paste into the Find what box
I haven’t tried the duck-my-head solution :-) but at the paste step I quoted I’d also add “hope that what you’re pasting is 2046 characters or less”.
-
Hi @guy038 and all,
Thank you for the creative suggestions. I appreciate it.
With ComparePlus:
Save the file (if not already saved).
Remove Duplicate Lines.
ComparePlus -> Diff since last Save. -
Hi, @yaron, @coises, @mkupper, @alan-kilborn and All,
First of all, a regex point that I’d never thought of before ! Even, if no character class, contaiing letters, exists in your entire search regex, the use of the
-i
ori
modifier does modify its behaviour about a case match !For example, using the simple INPUT text, in a new tab :
Fix Search result's text direction (RTL) not always synchronized with main edit zone's one issue. Fix Search result's text direction (RTL) not always synchronized with main edit zone's one issue.
The regex
(?-is)^(.+)(?=\R(?s).*?^\1$)
does match the first lineBut with that INPUT text, where I changed the case of the part between parentheses of the second line :
Fix Search result's text direction (RTL) not always synchronized with main edit zone's one issue. Fix Search result's text direction (rtl) not always synchronized with main edit zone's one issue.
The regex would not match anything, despite the fact that no letter seems fully involved in this search regex !
So, @coises, you may add the leading modifier
(?-i)
or(?i)
in front of your regex^([^\r\n]++)(?=[\s\S]*?^\1$)
:-)
@coises, I’m rather stubborn too and I I’ve found out an almost easy way to get the same results as you
So, if I take back my previous INPUT text, containing some duplicate lines
Fix regression of auto-Indent. Add custom URI schemes ability. Fix Search result's text direction (RTL) not always synchronized with main edit zone's one issue. Improve URL parser: fix apostrophe in an URL issue. Add context menu with "Copy link" ability. Add color samples on menu items for styling features. Add "-settingsDir" argument for overriding default settings path. Fix crash while exit command issued by plugin. Fix several bugs of PHP parser rule for function list. Fix find in files failure issue due to directory path with leading/trailing spaces. Move "Normal Text" to top in Languages Menu. Add new API NPPM_GETSETTINGSONCLOUDPATH for plugins. Add an option for displaying constant line number width. Fix function list is empty with new user profile in the same PC issue. Fix dockable panels display issue in RTL direction. Fix single-quoted string being badly recognized as attribute value in XML. Fix docked panels appear with "-nosession" cmd line parameters. Improve text selection after Replace All In Selection operation. Add the number of total documents on windows dialog's title bar. Fix scroll to last line problem after main window resizing. Fix Plugin admin display UTF-8 issue in its description. Fix Search result's text direction (RTL) not always synchronized with main edit zone's one issue. Fix Search result line number highlighting inaccurate issue. Make "Line" preceding each line number on Search Results translatable. Fix menu check marks not being removed after closing "Clipboard History" and "Character Panel" panels. Prevent corruption possibility when using -p command line parameter in a UTF file. Fix command line arguments -p, -n & -c negative value's undefined behaviour. Add new Margin/Border/Edge sub-page in Preferences. Fix find in files failure issue due to directory path with leading/trailing spaces. Fix folder icon display issue in "Folder as Workspace" after "Expand/Collapse All". Make "Clipboard History" and "Character Panel" togglable. Fix Find in found results dialog launch failure after macro execution. Fix Search result's text direction (RTL) not always synchronized with main edit zone's one issue. Disallow Goto dlg offset option from moving to position inside multi-byte char or between CR and LF. Fix "Go to..." dialog wrong Offset value in empty files. Prevent corruption possibility when using -p command line parameter in a UTF file. Improve indent guidelines on non-brace control block languages. Prevent names of untitled tabs from duplication. Add tooltips for Folder as Workspace 3 commands. Fix find in files failure issue due to directory path with leading/trailing spaces. Fix "SCI_NEWLINE" inside a macro not working issue. Improve URL parser: fix apostrophe in an URL issue. Fix bug where search-results won't open 'new 1' file. Fix tab close button remain pushed issue. Enhance ghost typing command line argument feature - using white space directly instead of %20. Fix dockable panels display issue in RTL direction.
I used the following regex S/R, which adds a
¤
character at the end of any line which have duplicate(s) in current fileOf course, you may use any other char, which is totally absent in your present file
So :
-
Move the caret at the very beginning of current file
-
Open the Replace dialog (
Ctrl + H
) -
Untick all box options
-SEARCH
(?-is)^(.+)¤?\R((?:.*\R)*?)\1(?<!¤)(?=\R|\z)
-REPLACE
\1¤\r\n\2\1¤
-
Tick the
Wrap around
option -
Select the
Regular expression
search mode -
Click on the
Replace All
button, repeatedly, till you get the messageReplace All: 0 occurrences were replaced from carat to end-of-file
-
As you can see, this regex is run repeatedly, till there is no duplicate line, anymore and, each time, there is :
-
A search for a complete block of lines between two identical lines
-
A replacement of all the contained block, with a
¤
character added at the very end of the two identical lines which surround that block
-
-
Now, just switch to the find dialog
-
Move back the caret at the very beginning of current file, if necessary
-
SEARCH
¤$
-
Click on the
Find All in Current Document
button
=> You should get this expected text in the
Search results
panel :Search "¤$" (12 hits in 1 files of 1 searched) D:\@@\792\Test_AP.txt (12 hits) Line 3: Fix Search result's text direction (RTL) not always synchronized with main edit zone's one issue.¤ Line 4: Improve URL parser: fix apostrophe in an URL issue.¤ Line 10: Fix find in files failure issue due to directory path with leading/trailing spaces.¤ Line 15: Fix dockable panels display issue in RTL direction.¤ Line 22: Fix Search result's text direction (RTL) not always synchronized with main edit zone's one issue.¤ Line 26: Prevent corruption possibility when using -p command line parameter in a UTF file.¤ Line 29: Fix find in files failure issue due to directory path with leading/trailing spaces.¤ Line 33: Fix Search result's text direction (RTL) not always synchronized with main edit zone's one issue.¤ Line 36: Prevent corruption possibility when using -p command line parameter in a UTF file.¤ Line 40: Fix find in files failure issue due to directory path with leading/trailing spaces.¤ Line 42: Improve URL parser: fix apostrophe in an URL issue.¤ Line 46: Fix dockable panels display issue in RTL direction.¤
Voila !
REMARKS :
-
The number of lines and characters of each block, between two identitcal lines, does matter !
-
For a text containing only
1 char/line
the regex can handle a block of26,500
lines, about -
For a text containing only
70 char/line
the regex can handle a block of3,000
lines, about -
For a text containing only
140 char/line
the regex can handle a block of2,800
lines, about
-
-
If you exceed these values, you’ll probably get a catastropkic breakdown event which invalids the whole method
So, one advice :
- Fisrt, get successive matches of the regex instead of doing the different replacements and :
If, at some step of the search, the number of selected lines is equal to number
Ln - 1
, in the status bar, then something went wrong and you cannot rely on this method, given your INPUT file ;-((If, for all steps, the number of selected lines is inferior to number
Ln - 1
, in the status bar, then you can be confident regarding this method !Best Regards
guy038
-
-
@guy038 When you use a
\1
backreference in a search expression then the state of the ignore-case flag matters if the capture group contains letters. I take advantage of this all the time.For example,
(?-i)(...)\1
only matches on line 1:1 abcabc 2 abcaBc
-
Even though OP had no interest in a PythonScript solution, perhaps others will, so here I present my solution which I call
FindAndDisplayAllDuplicateLines.py
:# -*- coding: utf-8 -*- from __future__ import print_function ######################################### # # FindAndDisplayAllDuplicateLines (FADADL) # ######################################### # references: # https://community.notepad-plus-plus.org/topic/25145/find-and-display-all-duplicate-lines # for newbie info on PythonScripts, see https://community.notepad-plus-plus.org/topic/23039/faq-desk-how-to-install-and-run-a-script-in-pythonscript #------------------------------------------------------------------------------- from Npp import * import os import re from collections import OrderedDict #------------------------------------------------------------------------------- class FADADL(object): def __init__(self): editor.callback(self.doubleclick_callback, [SCINTILLANOTIFICATION.DOUBLECLICK]) def run(self): source_pathname = notepad.getCurrentFilename() line_list_by_contents_odict = OrderedDict() # create data structure of unique line content and list of line numbers that same content appears on def fel_func(contents, line_number, total_lines): contents = contents.rstrip('\n\r') if len(contents) > 0: # avoid recording empty lines (rarely want those considered as duplicates) if contents in line_list_by_contents_odict: line_list_by_contents_odict[contents].append(line_number) else: line_list_by_contents_odict[contents] = [ line_number ] editor.forEachLine(fel_func) # generate output text and hold in memory for now num_sets_of_duplicates = 0 output_line_list = [] for line_contents in line_list_by_contents_odict: if len(line_list_by_contents_odict[line_contents]) > 1: num_sets_of_duplicates += 1 output_line_list.append(' Set-{i} text: {c}'.format(c=line_contents.lstrip(), i=num_sets_of_duplicates)) for line_number in line_list_by_contents_odict[line_contents]: user_line_number = line_number + 1 output_line_list.append('\tLine {n}'.format(n=user_line_number)) # create and open a results file (extension .sr means "search results") output_file_path = os.path.expandvars(r'%TEMP%\DupeLineResults.sr') if not os.path.exists(output_file_path): open(output_file_path, 'w').close() # so notepad.open() won't prompt or fail on non-existent file notepad.open(output_file_path) eol = ['\r\n', '\r', '\n'][editor.getEOLMode()] editor.setText('Search DUPLICATE LINES ({n} set{s}) in "{p}"'.format( n=num_sets_of_duplicates, p=source_pathname, s='s' if num_sets_of_duplicates != 1 else '' ) + eol) editor.appendText(eol.join(output_line_list) + eol) notepad.save() def doubleclick_callback(self, args): # when user double-clicks a "Line xxx" line, jump to that line in the source file, # just like Notepad++'s Search-results panel works first_line_content = editor.getLine(0) m = re.match(r'Search DUPLICATE LINES \(\d+ sets?\) in "(?P<src_path>.+)"', first_line_content) if m: source_pathname = m.group('src_path') double_clicked_line_number = args['line'] double_clicked_line_content = editor.getLine(double_clicked_line_number) m = re.match(r'\tLine (?P<src_user_line>\d+)', double_clicked_line_content) line_in_source_file = int(m.group('src_user_line')) - 1 if m: for (pathname, buffer_id, index, view) in notepad.getFiles(): if pathname == source_pathname: notepad.activateIndex(view, index) editor.gotoLine(line_in_source_file) break #------------------------------------------------------------------------------- if __name__ == '__main__': try: fadadl except NameError: fadadl = FADADL() fadadl.run()
Let’s demo the script:
Take some text and put it in a new tab (that you can hard-name save, or not):
A B A C A B
to get:
With the tab active, run the script; a new tab will be created, with content:
If you double-click a
Line _
line in the output file, you’ll be taken to the original input file at the indicated line number (much like how Notepad++'s Search results panel double-click works).–
Moderator EDIT (2024-Jan-14): The author of the script has found a fairly serious bug with the code published here for those that use Mac-style or Linux-style line-endings in their files. The logic for Mac and Linux was reversed, and thus if the script was used on one type of file, the line-endings for the opposite type of file could end up in the file after the script is run. This is insidious, because unless one works with visible line-endings turned on, this is likely not noticed. Some detail on the problem is HERE. The script above has been corrected per that instruction. -
-
Thank you for the beautiful script.
I don’t get Email notifications on new replies to threads I’m watching.
Any idea?Thank you.
- Logged in via GitHub.
-
@Alan-Kilborn The script is pretty cool.
I saw in the code that you support double-clicking and so played with that. One puzzle is that it intermittently creates random selections when I double click. For example, for one match I see:
Line 19360 Line 19364
I double click on
Line 19360
and am taken to that line in my text file. I do Ctrl+PageDown to flip to theDupeLineResults.sr
file/tab and double click onLine 19364
expecting to be dropped down four lines. Instead the page starts with line 19360 and there is a selection running down to the middle of line 19398. Using the mouse and scroll bar I see that the selection starts started in the middle of line 991.At first I thought the cause was this particular file has a number of non-printable characters, extended Unicode characters, and illegal byte sequences. I created a copy of this file that is restricted to plain ASCII and tested again. The line numbers I noted above are from a plain ASCII file though it’s still classified as UTF-8 in the status line. A regexp search for
[ -~\t\r\n]+
gets one hit which runs from the first line to the end which verifies it’s plain ASCII.The random selections are intermittent. Sometimes they happen, and sometimes not.
This is on Notepad++ v8.6 (32-bit)
Build time : Nov 18 2023 - 00:41:46
Path : C:\Program Files (x86)\Notepad++\notepad++.exe
Command Line : “c:\tmp\tmp”
Admin mode : OFF
Local Conf mode : OFF
Cloud Config : OFF
OS Name : Windows 11 Home (64-bit)
OS Version : 22H2
OS Build : 22621.2715
Current ANSI codepage : 1252
Plugins :
DSpellCheck (1.5)
mimeTools (2.9)
NppConverter (4.5)
NppExport (0.4)
NppTextFX (0.2.6)
PythonScript (2)