How to remove duplicate row in Find result?



  • Anyone who know how to remove duplicate row in Find result?
    When linux grep is used, only one line is displayed even if several words match on one line.
    However, notepad find result displays a line for the number of matched words.



  • Hello Dongwan Shin,

    Quite easy ! Once you get your results, in the Find Result panel :

    • RIGHT-click on the green line, with the absolute path of your file, and the number of hits

    • Select the option Copy

    • Open a new tab ( Ctrl + N )

    • Click on this new file to get the focus

    • Paste the clipboard contents ( Ctrl +V )

    • Move back to the very beginning of this new file

    • Open the Replace dialog ( Ctrl + H )

    • Type (?-s)(^.*\R)\1+ in the Find what: zone

    • Type \1 in the Replace with: zone

    • Select the Regular expression search mode

    • Click on the Replace All button

    • Finally, save this list of NON-duplicate lines

    Et voilà !

    Best regards,

    guy038



  • @guy038 Could you explain how that regular expression works? In particular, the parts I’m curious about are in the first group, ?-s, and in the second group, the \R. Thanks.



  • Hello, Casey Crockett

    No problem ! So :

    1) :

    Casey, the first part, of the regex, (?-s), is NOT a group, at all ! It’s an in-line pattern modifier. Our N++ regex engine can understand four modifiers :

    • The (?-s) modifier means that the dot ( . ) meta-character matches standard characters, ONLY, ( except for the \f character ), even if you previously checked the . matches newline option. The opposite syntax, (?s), would force the regex engine to consider that the dot matches ANY character ( standard and EOL charcraters ), even if you did not check, previously, the .matches newline option.

    • The (?-i) modifier forces the regex engine to search, in a sensitive way, even you did not check, previously, the Match case option. The opposite syntax (?i) would do a search, rergardless to case of letters, even if you, previously, checked the Match case option.

    Two other modifiers, less used, are, also, available :

    • The (?m) modifier, which means that the ^ and $ assertions represent, respectively, the beginning and end of any line ( default N++ behaviour ). The opposite syntax (?-m) means that the ^ and $ assertions represent, ONLY, the very beginning and the very end of the file

    • The (?-x) modifier, which means that any space character, in the regex, is significant ( default N++ behaviour ). The opposite syntax (?x) means that you may separate the different parts of a complex regex with some space character, which are ignored by the regex engine. Note the the two syntaxes [ ], with a space between the two square brackets or \, with a space, after the anti-slash character, stands for an true space character !

    • Note that you may group these modifiers, as, for instance, (?s-i) or (?xi-s) and that a modifier takes effect, till the opposite form is found, further on, in the regex or, of course, till the end of the regex

    2) :

    • Outside a character class ( [....] ), the escape sequence \R matches ANY Unicode New line sequence. The exact definition of the escape sequence \R is (?>\r\n|[\n\x0b\f\r\x85\x{2028}\x{2029}]).

    • Just notice the atomic group contruction, which means that, for instance, if the \R form would match the two End-of-Line characters \r\n, the regex engine will never try to backtrack, for matching the possible remainder of the overall regex, located after an \R !

    • But, most of the time, we may, simply, consider that the \R syntax matches ANY kind of EOL character :

      • The sequence \r\n ( case of a Windows End of Line )

      • The \n character ( case of a Unix End of Line )

      • The \r character ( case of a Machintosh End of Line )


    Therefore, the regex (?-s)(^.*\R)\1+ tries to find :

    • From beginning of line, ANY range of standard characters, even empty, followed by the current EOL character(s)

    • This complete current line is stored as group 1, due to the couple of parentheses, after the modifier

    • The regex engine, then, searches for any additional line(s), identical to the previous one ( The + quantifier is equivalent to {1,} )

    • In replacement, the block of identical lines ( the first and the additional ones ) is replaced by ONE occurrence, only, of that line ( \1 ) !


    To go further on, refer, firstly, to the Notepad++ wiki article, dealing with regular expressions, :

    http://docs.notepad-plus-plus.org/index.php/Regular_Expressions


    In addition, you’ll find good documentation, about the new Boost C++ Regex library, v1.55.0 ( similar to the PERL Regular Common Expressions, v1.48.0 ), used by Notepad++, since its 6.0 version, at the TWO addresses below :

    http://www.boost.org/doc/libs/1_48_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html

    http://www.boost.org/doc/libs/1_48_0/libs/regex/doc/html/boost_regex/format/boost_format_syntax.html

    • The FIRST link explains the syntax, of regular expressions, in the SEARCH part

    • The SECOND link explains the syntax, of regular expressions, in the REPLACEMENT part


    You may, also, look for valuable informations, on the sites, below :

    http://www.regular-expressions.info

    http://www.rexegg.com

    http://perldoc.perl.org/perlre.html

    Best Regards,

    guy038



  • @guy038 Thank You! You went above and beyond.


Log in to reply