<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Deleting lines that repeat the first 15 characters]]></title><description><![CDATA[<p dir="auto">I found very helpful the solution to eliminate duplicate lines found at</p>
<p dir="auto"><a href="https://notepad-plus-plus.org/community/topic/13147/eliminating-duplicate-identical-lines" rel="nofollow ugc">https://notepad-plus-plus.org/community/topic/13147/eliminating-duplicate-identical-lines</a></p>
<p dir="auto">How can the regex in the search field <strong>(?-s)(^.+\R)\1+</strong> be modified so that a line is deleted if the first 15 characters of a line match the first 15 characters of the preceding line?</p>
<p dir="auto">Thank you,<br />
Doug</p>
]]></description><link>https://community.notepad-plus-plus.org/topic/14729/deleting-lines-that-repeat-the-first-15-characters</link><generator>RSS for Node</generator><lastBuildDate>Tue, 21 Apr 2026 23:25:28 GMT</lastBuildDate><atom:link href="https://community.notepad-plus-plus.org/topic/14729.rss" rel="self" type="application/rss+xml"/><pubDate>Fri, 03 Nov 2017 02:07:17 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to Deleting lines that repeat the first 15 characters on Thu, 21 Oct 2021 00:27:53 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/23475">@Saya-Jujur</a> ,</p>
<p dir="auto">Untested, because I am on my phone, but maybe try</p>
<pre><code>prev = ''
with open('data.txt') as f:
    for (n, line) in enumerate(f):
        if line[:200] == prev[:200]:
            print n+1
        prev = line[:200]
</code></pre>
<p dir="auto">(You said you changed to 200 already, but maybe you missed an instance, or maybe comparing just the left of prev is enough)</p>
<p dir="auto">If that doesn’t work, then follow <a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/12335">@Terry-R</a>’s advice</p>
]]></description><link>https://community.notepad-plus-plus.org/post/70741</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/70741</guid><dc:creator><![CDATA[PeterJones]]></dc:creator><pubDate>Thu, 21 Oct 2021 00:27:53 GMT</pubDate></item><item><title><![CDATA[Reply to Deleting lines that repeat the first 15 characters on Wed, 20 Oct 2021 23:02:43 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/23475">@Saya-Jujur</a> said in <a href="/post/70739">Deleting lines that repeat the first 15 characters</a>:</p>
<blockquote>
<p dir="auto">How can we delete duplicate lines if first 40 words (or lets say, first 200 characters including spaces)  are same? I have changed 15 to 200, I am afraid the code did not work.</p>
</blockquote>
<p dir="auto">It would have been better to have started a new thread since this one was last posted to 4 years ago. By all means reference it but a new one I think is warranted.</p>
<p dir="auto">You don’t give much detail on your need, are the lines together as that is what this thread was all about.</p>
<p dir="auto">So start a new post, outline your need, give examples. Read the post at the top (of the Help Wanted section) titled “Please read before posting” as it will help you provide examples in a format that we can trust haven’t been altered by the posting window and we can copy to help us in tests before we provide a solution to you.</p>
<p dir="auto">Terry</p>
<p dir="auto">PS your request to Scott Sumner directly will likely go unanswered (by him), he hasn’t been active on this forum for a long time.</p>
]]></description><link>https://community.notepad-plus-plus.org/post/70740</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/70740</guid><dc:creator><![CDATA[Terry R]]></dc:creator><pubDate>Wed, 20 Oct 2021 23:02:43 GMT</pubDate></item><item><title><![CDATA[Reply to Deleting lines that repeat the first 15 characters on Wed, 20 Oct 2021 22:51:15 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/374">@Scott-Sumner</a> , about that python code:</p>
<pre><code>prev = ''
with open('data.txt') as f:
    for (n, line) in enumerate(f):
        if line[:15] == prev:
            print n+1
        prev = line[:15]
</code></pre>
<p dir="auto">How can we delete duplicate lines if first 40 words (or lets say, first 200 characters including spaces)  are same? I have changed 15 to 200, I am afraid the code did not work.</p>
<p dir="auto">Thank you</p>
]]></description><link>https://community.notepad-plus-plus.org/post/70739</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/70739</guid><dc:creator><![CDATA[Saya Jujur]]></dc:creator><pubDate>Wed, 20 Oct 2021 22:51:15 GMT</pubDate></item><item><title><![CDATA[Reply to Deleting lines that repeat the first 15 characters on Thu, 23 Nov 2017 15:40:51 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/195">@guy038</a></p>
<p dir="auto">Yea, wow, I totally didn’t see the missing <code>^</code> as well.  Of course, as our local regex guru I don’t normally question <a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/195">@guy038</a>’s regexes, but there is no excuse for a second pair of eyes (mine) not noticing/questioning this.  Looking back over my posts in this thread, I really added nothing of value and totally wish I hadn’t participated at all.  :-(</p>
]]></description><link>https://community.notepad-plus-plus.org/post/28205</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/28205</guid><dc:creator><![CDATA[Scott Sumner]]></dc:creator><pubDate>Thu, 23 Nov 2017 15:40:51 GMT</pubDate></item><item><title><![CDATA[Reply to Deleting lines that repeat the first 15 characters on Tue, 21 Nov 2017 19:48:00 GMT]]></title><description><![CDATA[<p dir="auto">Hello, <a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/10281">@mangoguy</a>, <a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/374">@scott-sumner</a> and <strong>All</strong>,</p>
<p dir="auto">I’m <strong>extremely</strong> confused, Indeed ! I did an <strong>important</strong> and <strong>beginner</strong> mistake, in my <strong>previous</strong> regex, that I was testing, intensively :-(( My God, of course ! The <strong>RIGHT</strong> regex is <strong><code>(?-s)^(.{15}).*\R\K(?:\1.*\R)+</code></strong> and <strong>NOT</strong> the regex <strong><code>(?-s)(.{15}).*\R\K(?:\1.*\R)+</code></strong> :-))</p>
<p dir="auto">Do you see the <strong>difference</strong> ? Well, it’s just the anchor <strong><code>^</code></strong>, after the modifier <strong><code>(?-s)</code></strong> !</p>
<p dir="auto">Indeed, let’s try again the <strong>wrong</strong> regex :</p>
<p dir="auto">Assuming the <strong>test</strong> list, below :</p>
<pre><code class="language-diff">91,02,2013,1000   000001   ,22.107,22.513,20.976,21.151,0
13,1000   000002   ,20.976,21.724,20.620,21.336,0
13,1000   000003   ,21.344,22.116,21.336,21.918,0
13,1000   000004   ,21.918,21.918,20.797,20.797,0
</code></pre>
<p dir="auto">So, first, the <strong>caret</strong> is right before the <strong>9</strong> digit, of the <strong>first</strong> line and the <strong>fifteen</strong> characters <strong><code>91,02,2013,1000</code></strong> <strong>cannot</strong> be found elsewhere. Then, as no anchor <strong><code>^</code></strong> ( <strong>beginning</strong> of line ) exists, the regex engine goes ahead <strong>one</strong> position between the digits <strong>9</strong> and <strong>1</strong> of the <strong>first</strong> line. Again, as the <strong>fifteen</strong> characters <strong><code>1,02,2013,1000b</code></strong> do <strong>not</strong> exist further on, the regex engine goes ahead <strong>one</strong> position, examining, now the string <strong><code>,02,2013,1000bb</code></strong> …</p>
<p dir="auto">… till the <strong>fifteen</strong> characters <strong><code>13,1000bbb00000</code></strong>, which can be found, this time, at <strong>beginning</strong> of lines <strong><code>2</code></strong>, <strong><code>3</code></strong> and <strong><code>4</code></strong> ! Just imagine the work to accomplish for <strong><code>458,404</code></strong> lines of the <strong>Data2.txt</strong> file :-((</p>
<p dir="auto">( Note : the <strong>lowercase</strong> letter <strong><code>b</code></strong>, above, stands for a <strong>space</strong> character )</p>
<p dir="auto">To <strong>easily</strong> see the problem, just get rid of the <strong><code>\K</code></strong> syntax, forming the regex <strong><code>(?-s)(.{15}).*\R(?:\1.*\R)+</code></strong>. If you click on the <strong>Find Next</strong> button, it selects, <strong>after</strong> test on positions <strong>1</strong>, <strong>2</strong>,…and <strong>8</strong>, from the <strong>two last</strong> digits of year <strong>2013</strong> till the <strong>end</strong> of text. But, if you’re using the regex <strong><code>(?-s)^(.{15}).*\R(?:\1.*\R)+</code></strong>, with the <strong>anchor</strong> <strong><code>^</code></strong>, it <strong>correctly</strong> gets the <strong>identical</strong> lines <strong><code>2</code></strong>, <strong><code>3</code></strong> and <strong><code>4</code></strong>, regarding theirs <strong>first</strong> <strong><code>15</code></strong>  characters !</p>
<hr />
<p dir="auto">So, <strong>Doug</strong>, to sump up, using the <strong>right</strong> regex <strong><code>(?-s)^(.{15}).*\R\K(?:\1.*\R)+</code></strong>, against your <strong>Data2.txt</strong> file, does <strong>not</strong> find any occurrence ( <strong><code>~5s</code></strong> ), that is the <strong>expected</strong> result, as we know, by construction, that the <strong><code>458,404</code></strong> lines of this file, are <strong>all different</strong> :-)</p>
<p dir="auto">Best Regards,</p>
<p dir="auto">guy038</p>
]]></description><link>https://community.notepad-plus-plus.org/post/28156</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/28156</guid><dc:creator><![CDATA[guy038]]></dc:creator><pubDate>Tue, 21 Nov 2017 19:48:00 GMT</pubDate></item><item><title><![CDATA[Reply to Deleting lines that repeat the first 15 characters on Fri, 17 Nov 2017 14:51:09 GMT]]></title><description><![CDATA[<p dir="auto">So <a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/195">@guy038</a>’s results and conclusions are interesting.  I decided to see what would happen if a Pythonscript-based search was conducted.  To that end I came up with:</p>
<pre><code>matches = []
def match_found(m): matches.append(m.span(0))
editor.research(r'(?-s)(.{15}).*\R\K(?:\1.*\R)+', match_found)
for (start, _) in matches: print editor.lineFromPosition(start) + 1
print 'done'
</code></pre>
<p dir="auto">With that script and the DATA2.txt file, I found that with 67025 lines in the file I would see “done” printed in the PS console window, but with one more line, 67026, I would get this:</p>
<pre><code>Traceback:
    editor.research(r'(?-s)(.{15}).*\R\K(?:\1.*\R)+', match_found)
&lt;type 'exceptions.RuntimeError'&gt;:  The complexity of matching the regular expression exceeded predefined bounds.  Try refactoring the regular expression to make each choice made by the state machine unambiguous.  This exception is thrown to prevent "eternal" matches that take an indefinite period time to locate.
</code></pre>
<p dir="auto">This seems consistent with <a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/195">@guy038</a>’s findings that somewhere between 67000 and 67100 lines there is a “problem”.</p>
<p dir="auto">So I think the meaning of all this is that Notepad++ is not a great tool for the OP’s task.  :-(</p>
<p dir="auto">No one wants to be trying to solve one problem, only to encounter problems with the method they are using to solve that problem.  Thus, I’d advise, if this is a recurring need, to have a serious look at the short bit of standard Python (or rewrite in your language of choice) that I provided much earlier in this thread.  :-D</p>
]]></description><link>https://community.notepad-plus-plus.org/post/28101</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/28101</guid><dc:creator><![CDATA[Scott Sumner]]></dc:creator><pubDate>Fri, 17 Nov 2017 14:51:09 GMT</pubDate></item><item><title><![CDATA[Reply to Deleting lines that repeat the first 15 characters on Sat, 19 Nov 2022 21:42:56 GMT]]></title><description><![CDATA[<p dir="auto">Hi, <a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/10281">@mangoguy</a>, <a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/374">@scott-sumner</a> and <strong>All</strong>,</p>
<p dir="auto">To begin with, <strong>Doug</strong>, I was a bit surprised that, both, the <strong>numbers</strong>, at column <strong><code>19</code></strong> and the <strong>first 15th</strong> characters look equally <strong>sorted</strong>, in your <strong>Data2.txt</strong> file  ! So I hope that you understood  that the <strong>first sort</strong> must be performed, <strong>after</strong> the use of the <strong>Column Editor</strong>. Indeed, these <strong>numbers</strong> are just added in order to get the <strong>original</strong> order back, <strong>after</strong> the suppression of all the <strong>duplicate</strong> lines ! Just a remark :-))</p>
<p dir="auto">Now, <strong>mangoguy</strong> and others, keep in mind that, when a rather <strong>complicated</strong> regex is applied, against an <strong>important</strong> file, a complete <strong>failure</strong> may occur, with only <strong><code>1 match</code></strong> which represents, simply, the <strong>selection</strong> of <strong>all</strong> the file contents :-((</p>
<p dir="auto">So, I began to investigate this problem, more deeply ! First of all, I verified that the <strong>first 15th</strong> characters, of your <strong>Data2.txt</strong> file, had absolutely <strong>no duplicate</strong>  And, like <strong>Scott</strong> and you, I noticed that the regex <strong><code>(?-s)(.{15}).*\R\K(?:\1.*\R)+</code></strong>, wrongly selects the <strong>whole</strong> file, after a while, instead of finding <strong>0 result</strong></p>
<hr />
<p dir="auto">At this point, I simply thought about <strong>reducing</strong> the file to reach the <strong>upper</strong> value, beyond we get into trouble. It happened, that, with my old <strong>Win XP</strong> laptop, the <strong>limit</strong> is <strong><code>67,000</code></strong> lines about. For this value, you get the correct result : <strong>no match</strong>. But, for instance, with <strong><code>67,100</code></strong> lines, we get the <strong>non-correct</strong> one match !</p>
<p dir="auto">Note that using the <strong>similar</strong> regex <strong><code>(?-s)(.{15}).*\R\K(?:\1.*\R)</code></strong>, without the <strong><code>+</code></strong> sign, at its <strong>end</strong>, this limit increases to <strong><code>68,830</code></strong> lines about !</p>
<hr />
<p dir="auto">So I was wondering : Could it be that the <strong>lack</strong> of matches, with the necessity to scan <strong>great</strong> amount of data, causes that <strong>false positive</strong> ? So, strangely, I decided to <strong>add false positives</strong> every <strong><code>65,000</code></strong> lines about, as below :</p>
<pre><code class="language-diff">---------------
---------------
</code></pre>
<p dir="auto">So, I added these <strong>two</strong> lines of <strong><code>15</code></strong> dashes, at lines <strong><code>65,000</code></strong>, <strong><code>130,000</code></strong>, <strong><code>195,000</code></strong>, <strong><code>260,000</code></strong>, <strong><code>325,000</code></strong>, <strong><code>390,000</code></strong> and <strong><code>455,000</code></strong>. In addition, I duplicated the <strong>first</strong> line as well as the <strong>last</strong> line of the file.</p>
<p dir="auto">If my <strong>intuition</strong> was correct, the regex would match, of course, all the <strong>second</strong> lines of <strong>dashes</strong> ( <em>false positives</em> ) but also, the <strong>first</strong> duplicate, in line <strong><code>2</code></strong> and the <strong>second</strong> duplicate, at <strong>end</strong> of file. This would prove that the search <strong>process</strong> can go on, normally, throughout an <strong>important</strong> file ! I ran a <strong>Find All in Current Document</strong> process and… <strong>Bingo</strong> ! I obtained the <strong>Find Result</strong> panel, below, with the <strong>expected</strong> results :</p>
<pre><code class="language-diff">Search "(?-s)(.{15}).*\R\K(?:\1.*\R)+" (9 hits in 1 file)
  new 1 (9 hits)
	Line 2: 01,02,2013,1000   000001   ,22.107,22.513,20.976,21.151,0
	Line 65003: ---------------
	Line 130002: ---------------
	Line 195002: ---------------
	Line 260002: ---------------
	Line 325002: ---------------
	Line 390002: ---------------
	Line 455002: ---------------
	Line 458420: 12,31,2015,2559   458404   ,3.270,3.270,3.538,3.527,0
</code></pre>
<p dir="auto">Therefore, it seems that a <strong>too important gap</strong>, between <strong>two successive</strong> matches, causes the <strong>complete failure</strong> of the regex search process !? I just hope that, for most of users, this gap of <strong>65000</strong> lines about( perhaps, we’d better speak about <strong>bytes</strong> ! ), noted with my <strong>outdated</strong> laptop, can really be <strong>greater</strong> :-))</p>
<hr />
<p dir="auto">Instead of adding some <strong>false positives</strong>, in <strong>huge</strong> files, we could, also, search for a <strong>string</strong>, which would occur <strong><code>every x</code></strong> lines ! For instance, starting with the <strong>Data2.txt</strong> file, I build a file, made of <strong><code>five</code></strong> times <strong>Data2.txt</strong> : I just changed the <strong>first</strong> character of each line, taking, successively, <strong><code>3</code></strong> and <strong><code>4</code></strong>, then <strong><code>5</code></strong> and <strong><code>6</code></strong>,… instead of <strong><code>0</code></strong> and <strong><code>1</code></strong>, in order to keep a list of lines, <strong>without</strong> any <strong>duplicate</strong> :-)</p>
<p dir="auto">This file contained <strong><code>126,274,854</code></strong> bytes and <strong><code>2,292,022</code></strong> lines. So, I decided that, in addition to the detection of <strong>duplicates</strong>, with the regex <strong><code>(?-s)(.{15}).*\R\K(?:\1.*\R)+</code></strong>, I would search for lines <strong><code>50,000</code></strong>, <strong><code>100,000</code></strong>, and so on…, with the regex <strong><code>(5|0)0000\x20</code></strong> To that purpose, I just used the list of numbers, at column <strong><code>19</code></strong>, copied <strong>five</strong> times !</p>
<p dir="auto">So the <strong>final</strong> regex is , simply, the two <strong>alternatives</strong> : <strong><code>(?-s)(.{15}).*\R\K(?:\1.*\R)+|(5|0)0000\x20</code></strong>. Again, I clicked on the <strong>Find All in Current Document</strong> button and, …after <strong><code>6m 49s</code></strong>( Waoooou ! ) , the <strong>Find Result</strong> displayed, at last :</p>
<pre><code class="language-diff">Search "(?-s)(.{15}).*\R\K(?:\1.*\R)+|(5|0)0000\x20" (47 hits in 1 file)
  new 1 (47 hits)
	Line 2: 01,02,2013,1000   000001   ,22.107,22.513,20.976,21.151,0
	Line 50001: 02,11,2014,2536   050000   ,0.357,0.380,0.270,0.310,0
	Line 100001: 03,24,2014,1115   100000   ,5.494,5.191,5.494,5.299,0
	Line 150001: 05,05,2017,1346   150000   ,0.301,0.301,0.270,0.289,0
	Line 200001: 06,13,2013,1107   200000   ,0.519,0.588,0.516,0.588,0
	Line 250001: 07,23,2013,1437   250000   ,0.070,0.064,0.073,0.071,0
	Line 300001: 09,04,2013,1158   300000   ,2.314,2.368,2.314,2.362,0
	Line 350001: 10,06,2017,1031   350000   ,0.201,0.138,0.201,0.151,0
	Line 400001: 11,08,2012,1254   400000   ,1.263,1.253,1.284,1.284,0
	Line 450001: 12,21,2012,1043   450000   ,3.838,3.815,3.858,3.823,0
	Line 508405: 22,11,2014,2536   050000   ,0.357,0.380,0.270,0.310,0
	Line 558405: 23,24,2014,1115   100000   ,5.494,5.191,5.494,5.299,0
	Line 608405: 25,05,2017,1346   150000   ,0.301,0.301,0.270,0.289,0
	Line 658405: 26,13,2013,1107   200000   ,0.519,0.588,0.516,0.588,0
	Line 708405: 27,23,2013,1437   250000   ,0.070,0.064,0.073,0.071,0
	Line 758405: 29,04,2013,1158   300000   ,2.314,2.368,2.314,2.362,0
	Line 808405: 30,06,2017,1031   350000   ,0.201,0.138,0.201,0.151,0
	Line 858405: 31,08,2012,1254   400000   ,1.263,1.253,1.284,1.284,0
	Line 908405: 32,21,2012,1043   450000   ,3.838,3.815,3.858,3.823,0
	Line 966809: 42,11,2014,2536   050000   ,0.357,0.380,0.270,0.310,0
	Line 1016809: 43,24,2014,1115   100000   ,5.494,5.191,5.494,5.299,0
	Line 1066809: 45,05,2017,1346   150000   ,0.301,0.301,0.270,0.289,0
	Line 1116809: 46,13,2013,1107   200000   ,0.519,0.588,0.516,0.588,0
	Line 1166809: 47,23,2013,1437   250000   ,0.070,0.064,0.073,0.071,0
	Line 1216809: 49,04,2013,1158   300000   ,2.314,2.368,2.314,2.362,0
	Line 1266809: 50,06,2017,1031   350000   ,0.201,0.138,0.201,0.151,0
	Line 1316809: 51,08,2012,1254   400000   ,1.263,1.253,1.284,1.284,0
	Line 1366809: 52,21,2012,1043   450000   ,3.838,3.815,3.858,3.823,0
	Line 1425213: 62,11,2014,2536   050000   ,0.357,0.380,0.270,0.310,0
	Line 1475213: 63,24,2014,1115   100000   ,5.494,5.191,5.494,5.299,0
	Line 1525213: 65,05,2017,1346   150000   ,0.301,0.301,0.270,0.289,0
	Line 1575213: 66,13,2013,1107   200000   ,0.519,0.588,0.516,0.588,0
	Line 1625213: 67,23,2013,1437   250000   ,0.070,0.064,0.073,0.071,0
	Line 1675213: 69,04,2013,1158   300000   ,2.314,2.368,2.314,2.362,0
	Line 1725213: 70,06,2017,1031   350000   ,0.201,0.138,0.201,0.151,0
	Line 1775213: 71,08,2012,1254   400000   ,1.263,1.253,1.284,1.284,0
	Line 1825213: 72,21,2012,1043   450000   ,3.838,3.815,3.858,3.823,0
	Line 1883617: 82,11,2014,2536   050000   ,0.357,0.380,0.270,0.310,0
	Line 1933617: 83,24,2014,1115   100000   ,5.494,5.191,5.494,5.299,0
	Line 1983617: 85,05,2017,1346   150000   ,0.301,0.301,0.270,0.289,0
	Line 2033617: 86,13,2013,1107   200000   ,0.519,0.588,0.516,0.588,0
	Line 2083617: 87,23,2013,1437   250000   ,0.070,0.064,0.073,0.071,0
	Line 2133617: 89,04,2013,1158   300000   ,2.314,2.368,2.314,2.362,0
	Line 2183617: 90,06,2017,1031   350000   ,0.201,0.138,0.201,0.151,0
	Line 2233617: 91,08,2012,1254   400000   ,1.263,1.253,1.284,1.284,0
	Line 2283617: 92,21,2012,1043   450000   ,3.838,3.815,3.858,3.823,0
	Line 2292022: 92,31,2015,2559   458404   ,3.270,3.270,3.538,3.527,0
</code></pre>
<p dir="auto">As you can see, the <strong>duplicate</strong> line <strong><code>2</code></strong> and the second <strong>duplicate</strong>, at line <strong><code>2,292,022</code></strong>, were <strong>correctly</strong> found and reported !</p>
<hr />
<p dir="auto"><strong>Conclusion</strong> :</p>
<p dir="auto">Apparently, when a <strong>too important</strong> amount of text separates <strong>two consecutive</strong> occurrences of the regex search, it <strong>breaks</strong> the normal process, getting, wrongly, a <strong>single</strong> selection of <strong>all</strong> file contents !? So, Mangoguy, as <strong>no duplicate</strong> exists in your <strong>data2.txt</strong> file, it’s obvious that we’re going into trouble as soon as your file <strong>exceeds</strong> a certain size <strong>limit</strong> !</p>
<p dir="auto">In other words, if, in <strong>huge</strong> files, you get a <strong>lot</strong> of occurrences, throughout the file contents, this should <strong>help</strong> the search process to <strong>correctly</strong> finish the job :-))</p>
<p dir="auto">Best Regards,</p>
<p dir="auto">guy038</p>
]]></description><link>https://community.notepad-plus-plus.org/post/28091</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/28091</guid><dc:creator><![CDATA[guy038]]></dc:creator><pubDate>Sat, 19 Nov 2022 21:42:56 GMT</pubDate></item><item><title><![CDATA[Reply to Deleting lines that repeat the first 15 characters on Tue, 14 Nov 2017 13:25:59 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/10281">@mangoguy</a> said:</p>
<blockquote>
<p dir="auto">Replace All:1 occurrence was replaced</p>
</blockquote>
<p dir="auto"><em><strong>Formatting note:</strong></em>  Your regular expression was stated as <code>(?-s)(.{15}).\R\K(?:\1.\R)+</code> but I think you really meant <code>(?-s)(.{15}).*\R\K(?:\1.*\R)+</code> as per one of <a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/195">@guy038</a> 's regexes above.  In the future, wrap any exact text you want to post here in ` (backticks) to hopefully avoid any confusion.  For example, if you type in `hello` it should appear here as <code>hello</code> without any special characters having trouble.  You can also start a new line with four spaces and then your text to provide some data that won’t be specially interpreted.</p>
<p dir="auto">I see the same behavior as you when trying this regex replacement on your newest data file.  Note that the file is <em><strong>NOT</strong></em> modified by this replacement (disk icon on its tab remains blue after the “replacement” occurs…starting point was a freshly loaded DATA2.txt file).  I’m at a loss to explain this (why it is saying “1 replacement”).  This thread has brought out some really odd things!</p>
<p dir="auto">Note that it <strong>IS</strong> possible to see non-zero replacements listed and have a file NOT be modified (try a <strong>Find-what</strong> of <code>^</code> and a <strong>Replace-with</strong> of <code>$0</code>, also <strong>Reg exp</strong> search mode), but this is very different from your replacement action.</p>
]]></description><link>https://community.notepad-plus-plus.org/post/28027</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/28027</guid><dc:creator><![CDATA[Scott Sumner]]></dc:creator><pubDate>Tue, 14 Nov 2017 13:25:59 GMT</pubDate></item><item><title><![CDATA[Reply to Deleting lines that repeat the first 15 characters on Mon, 13 Nov 2017 22:11:14 GMT]]></title><description><![CDATA[<p dir="auto">Thank you for the clarification. It worked perfectly with the file exactly as instructed. Thank you!</p>
<p dir="auto">With another pre-sorted file which has no duplicates or blank lines, when I perform the main regex to find and remove duplicate lines<br />
(?-s)(.{15}).<em>\R\K(?:\1.</em>\R)+</p>
<p dir="auto">the replace box returns: “Replace All:1 occurrence was replaced” no matter how many times I repeat the replace. If there are no duplicates I would expect a report of 0 occurrences found.</p>
<p dir="auto">The file is found at<br />
<a href="https://mangoguy.sharefile.com/d-s7b2d2a8b3fb459cb" rel="nofollow ugc">https://mangoguy.sharefile.com/d-s7b2d2a8b3fb459cb</a></p>
<p dir="auto">Thank you,<br />
Doug</p>
]]></description><link>https://community.notepad-plus-plus.org/post/28017</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/28017</guid><dc:creator><![CDATA[mangoguy]]></dc:creator><pubDate>Mon, 13 Nov 2017 22:11:14 GMT</pubDate></item><item><title><![CDATA[Reply to Deleting lines that repeat the first 15 characters on Wed, 08 Nov 2017 18:04:47 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/10281">@mangoguy</a></p>
<p dir="auto">What do you mean by <code>^</code> for the caret?  I know that sometimes <code>^</code> is referred to as the caret character, but it is not in Notepad++ so I’m confused.</p>
<p dir="auto">Anyway, Here’s what I do and it works to insert at col 19:</p>
<ul>
<li>move caret to line 1 col 19</li>
<li>press <strong>Alt+c</strong> to get <strong>Column Editor</strong> window</li>
<li>tick <strong>Number to Insert</strong></li>
<li>specify <strong>Initial number</strong> of 0</li>
<li>specify <strong>Increase by</strong> of 1</li>
<li>specify an empty field for <strong>Repeat</strong></li>
<li>tick <strong>Leading zeros</strong></li>
<li>specify <strong>Dec</strong> for <strong>Format</strong></li>
<li>click <strong>OK</strong></li>
</ul>
<p dir="auto">Notepad++ places incrementing numbers in col 19 (and beyond) throughout the length of my document.</p>
<p dir="auto">Are you doing something very different from this?</p>
]]></description><link>https://community.notepad-plus-plus.org/post/27936</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/27936</guid><dc:creator><![CDATA[Scott Sumner]]></dc:creator><pubDate>Wed, 08 Nov 2017 18:04:47 GMT</pubDate></item><item><title><![CDATA[Reply to Deleting lines that repeat the first 15 characters on Wed, 08 Nov 2017 17:44:56 GMT]]></title><description><![CDATA[<p dir="auto">Thank you again.</p>
<p dir="auto">My bad for thinking a nonsorted file would not be a problem.</p>
<p dir="auto">Despite multiple attempts, every time I try to use the column editor, despite putting the caret “^” at column 19 on row 1, the incremental numerical column appears at position 1 and not position 19.</p>
<p dir="auto">Any thoughts as to what I am doing wrong?</p>
<p dir="auto">Thank you,<br />
Doug</p>
]]></description><link>https://community.notepad-plus-plus.org/post/27935</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/27935</guid><dc:creator><![CDATA[mangoguy]]></dc:creator><pubDate>Wed, 08 Nov 2017 17:44:56 GMT</pubDate></item><item><title><![CDATA[Reply to Deleting lines that repeat the first 15 characters on Tue, 07 Nov 2017 22:10:54 GMT]]></title><description><![CDATA[<p dir="auto">Hi, <a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/10281">@mangoguy</a>, <a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/374">@scott-sumner</a> and <strong>All</strong>,</p>
<p dir="auto">In my <strong>initial</strong> thread, below, whose <strong>Doug</strong> spoke of :</p>
<p dir="auto"><a href="https://notepad-plus-plus.org/community/topic/13147/eliminating-duplicate-identical-lines" rel="nofollow ugc">https://notepad-plus-plus.org/community/topic/13147/eliminating-duplicate-identical-lines</a></p>
<p dir="auto">I said :</p>
<blockquote>
<p dir="auto">The suppression of <strong>all</strong> the <strong>duplicate</strong> lines, in a <strong>pre-sorted</strong> file, can be easily obtained with a Search/Replacement, in <strong>Regular expression</strong> mode !</p>
<p dir="auto">Open your file, containing the <strong>sorted</strong> list of items</p>
<p dir="auto">Open the <strong>Replace</strong> dialog ( CTRL + H )<br />
…</p>
</blockquote>
<p dir="auto">In that text, it’s the word <strong>‘sorted’</strong> which is <strong>important</strong> ! Indeed, imagine this <strong>initial</strong> text, <strong>NOT</strong> sorted :</p>
<pre><code class="language-diff">pqrst
pqrst
pqrst
uvwxy
fghij
fghij
fghij
pqrst
pqrst
pqrst
abcde
abcde
abcde
abcde
fghij
klmno
klmno
klmno
fghij
fghij
</code></pre>
<p dir="auto">and  my <strong>initial</strong> regex :</p>
<p dir="auto">SEARCH <strong><code>(?-s)(^.+\R)\1+</code></strong></p>
<p dir="auto">REPLACE <strong><code>\1</code></strong></p>
<p dir="auto">Even with a <strong>step by step</strong> replacement, with the <strong>Replace</strong> button, it would give :</p>
<pre><code class="language-diff">pqrst
uvwxy
fghij
pqrst
abcde
fghij
klmno
fghij
</code></pre>
<p dir="auto">=&gt; You can see that it <strong>still</strong> remains <strong><code>3</code></strong> lines <strong>fghij</strong>, inside, split up on <strong>different</strong> lines. In a sense, it remains a <strong>single</strong> line with its <strong>two duplicates</strong> !</p>
<p dir="auto">Now, let’s sort the <strong>initial</strong> text, <strong>first</strong> :</p>
<pre><code class="language-diff">abcde
abcde
abcde
abcde
fghij
fghij
fghij
fghij
fghij
fghij
klmno
klmno
klmno
pqrst
pqrst
pqrst
pqrst
pqrst
pqrst
uvwxy
</code></pre>
<p dir="auto"><strong>After</strong> performing the <strong>same</strong> regex S/R, we, now, get the text :</p>
<pre><code class="language-diff">abcde
fghij
klmno
pqrst
uvwxy
</code></pre>
<p dir="auto">And, as expected, it does <strong>not</strong> contain any <strong>duplicate</strong> line. So, with an <strong>initial</strong> sort, this regex, and the <strong>derivative</strong> regexes, work just fine !</p>
<p dir="auto">However, the main <strong>drawback</strong> is that the original <strong>order</strong>, of the file, is <strong>lost</strong>, because of the <strong>sort</strong> process :-((</p>
<hr />
<p dir="auto">So, against your file, I tried <strong>other</strong> regexes, which involve a <strong>look-ahead</strong> and which do <strong>not</strong> break the file’s <strong>order</strong> :</p>
<ul>
<li>
<p dir="auto"><strong><code>(?-s)(?:^(.{15}).*\R)(?=.*\1)</code></strong> and <strong><code>EMPTY</code></strong> replacement =&gt; <strong>Count</strong> process give us <strong><code>1022</code></strong> occurrences, which is <strong>false</strong></p>
</li>
<li>
<p dir="auto"><strong><code>(?-s)(?:^(.{15}).*\R)(?=(?:.+\R)*\1)</code></strong> =&gt; <strong>Catastrophic</strong> break-down, with only <strong>one</strong> match ( the <strong>entire</strong> file contents ! )</p>
</li>
</ul>
<p dir="auto">So , what’s about this <strong>generic</strong> one :</p>
<ul>
<li><strong><code>(?-s)(?:^(.{15}).*\R)(?=(?:.+\R){0,n}\1)</code></strong></li>
</ul>
<p dir="auto">Well, but, for instance, <strong><code>4</code></strong> lines, in your file, <strong>begin</strong> with the string <strong>01,12,1215,1012</strong> ( Lines <strong><code>209988</code></strong>, <strong><code>208996</code></strong>, <strong><code>210928</code></strong> et <strong><code>210936</code></strong> ). And the <strong>gap</strong> between lines <strong><code>208996</code></strong> and <strong><code>210928</code></strong>, for instance, is <strong><code>1932</code></strong> !</p>
<p dir="auto">So, the number <strong><code>n</code></strong>, in that regex, should be, at least <strong><code>2000</code></strong>, hence the regex :</p>
<p dir="auto"><strong><code>(?-s)(?:^(.{15}).*\R)(?=(?:.+\R){0,2000}\1)</code></strong> =&gt; Again, a <strong>catastrophic</strong> break-down occurred :-((</p>
<p dir="auto">Obviously, it is <strong>not</strong> worth going on, in that direction ! So, <strong>Scott</strong>, I presume that when a <strong>large amount</strong> of lines and/or <strong>large capturing</strong> group areas are involved, we’re going to get, likely, <strong>unpredictable</strong> results ! We have to change our mind at all about it :-D</p>
<hr />
<p dir="auto">Finally, I found out the <strong>right</strong> way to get the job done :-)))</p>
<p dir="auto">Note that, in the <strong>Doug</strong>’s file, the <strong>key</strong> string is the <strong>first 15th</strong> characters of a line. So, here is the procedure :</p>
<ul>
<li>First, we get rid of possible <strong>blank</strong> lines, with the <strong>regex</strong> S/R, below :</li>
</ul>
<p dir="auto">SEARCH <strong><code>^\h*\R</code></strong></p>
<p dir="auto">REPLACE <strong><code>EMPTY</code></strong></p>
<p dir="auto">=&gt; <strong>2</strong> lines should be <strong>deleted</strong></p>
<ul>
<li>Then, we add a <strong>blank</strong> area, of <strong>six</strong> characters long, right <strong>after</strong> the <strong><code>15th</code></strong> character, with the <strong>regex</strong> S/R, below ( <strong><code>~23s</code></strong> ) :</li>
</ul>
<p dir="auto">SEARCH <strong><code>(?-s)^.{15}\K</code></strong></p>
<p dir="auto">REPLACE <strong><code>\x20\x20\x20\x20\x20\x20</code></strong></p>
<ul>
<li>
<p dir="auto">Place the <strong>caret</strong> at column <strong><code>19</code></strong> of the line <strong><code>1</code></strong> (IMPORTANT )</p>
</li>
<li>
<p dir="auto">Now, choose the menu command <strong>Edit &gt; Column Editor…</strong> ( <strong><code>Alt + C</code></strong> )</p>
</li>
<li>
<p dir="auto">Select the <strong>Number to Insert</strong> option</p>
</li>
<li>
<p dir="auto">Type in <strong><code>1</code></strong>, as <strong>Initial</strong> number</p>
</li>
<li>
<p dir="auto">Type in <strong><code>1</code></strong>, in the <strong>Increase by :</strong> option</p>
</li>
<li>
<p dir="auto">Verify that the chosen <strong>format</strong> is <strong>Dec</strong></p>
</li>
<li>
<p dir="auto">Check the <strong>Leading zeros</strong> option ( IMPORTANT )</p>
</li>
<li>
<p dir="auto">Finally, click on the <strong>OK</strong> button ( <strong><code>~36s</code></strong> )</p>
</li>
</ul>
<p dir="auto">=&gt; After a while, you should get a <strong>six-digits</strong> column, at position <strong><code>19</code></strong>, from <strong><code>000001</code></strong> to <strong><code>460726</code></strong></p>
<ul>
<li>
<p dir="auto">Delete the last line <strong><code>460726</code></strong> and the <strong>EOL</strong> characters of the line <strong><code>460725</code></strong></p>
</li>
<li>
<p dir="auto"><strong>Sort</strong> the file contents, choosing the menu command <strong>Edit &gt; Line Operations &gt; Sort Lines Lexicographically Ascending</strong> ( <strong><code>~17s</code></strong> )</p>
</li>
<li>
<p dir="auto">Now, perform the <strong>main</strong> regex, below, clicking on the <strong>Replace All</strong> button, exclusively ( <strong><code>~1m 11s</code></strong> )</p>
</li>
</ul>
<p dir="auto">SEARCH <strong><code>(?-s)(.{15}).*\R\K(?:\1.*\R)+</code></strong></p>
<p dir="auto">REPLACE <strong><code>EMPTY</code></strong></p>
<p dir="auto">=&gt; <strong>29363</strong> replacements occurred and the file, from now on, contains <strong><code>430346</code></strong> lines, only !</p>
<ul>
<li>Then, move the <strong>middle column</strong> number, at <strong>beginning</strong> of <strong>each</strong> line, with the <strong>regex</strong> S/R, below ( <strong><code>~31s</code></strong> ) :</li>
</ul>
<p dir="auto">SEARCH <strong><code>(?-s)^(.+?)\x20+(\d+)</code></strong></p>
<p dir="auto">REPLACE <strong><code>\2\x20\x20\x20\1</code></strong></p>
<ul>
<li>
<p dir="auto">Now, execute a <strong>last sort</strong> operation <strong>Edit &gt; Line Operations &gt; Sort Lines Lexicographically Ascending</strong> ( <strong><code>~15s</code></strong> )</p>
</li>
<li>
<p dir="auto">Finally <strong>get rid</strong> of the line numbers, at <strong>beginning</strong> of lines, along with the <strong>space</strong> characters, with the <strong>regex</strong> S/R, below ( <strong><code>~15s</code></strong> ) :</p>
</li>
</ul>
<p dir="auto">SEARCH <strong><code>^\d+\x20+</code></strong></p>
<p dir="auto">REPLACE <strong><code>EMPTY</code></strong></p>
<p dir="auto">Et voilà !</p>
<p dir="auto">To <strong>sump up</strong>, this procedure, while going <strong>downwards</strong>, throughout the file contents, keeps, only, each <strong>first</strong> occurrence of any <strong>duplicate</strong> line, with <strong>identical first 15th</strong> characters, as well as any <strong>single</strong> line, of course !</p>
<p dir="auto">Cheers,</p>
<p dir="auto">guy038</p>
<p dir="auto">P.S. :</p>
<p dir="auto">To <strong>easily</strong> understand my <strong>method</strong>, above, <strong>copy/paste</strong> this <strong>short initial</strong> text, below, in a <strong>new</strong> tab :</p>
<pre><code class="language-diff">pqrstSmith
pqrstJones
pqrstTaylor
uvwxyBrown
fghijWilliams
fghijWilson
fghijJohnson
pqrstDavies
pqrstRobinson
pqrstWright
abcdeThompson
abcdeEvans
abcdeWalker
abcdeWhite
fghijRoberts
klmnoGreen
klmnoHall
klmnoWood
fghijJackson
fghijClarke
</code></pre>
<p dir="auto">Then, I decided that <strong>two</strong> lines are <strong>identical</strong> if their <strong>first 5</strong> characters are <strong>equal</strong> !</p>
<p dir="auto">So, we add, first, a <strong>six blank</strong> characters column, <strong>after</strong> the <strong><code>5th</code></strong> character of <strong>each</strong> line</p>
<p dir="auto">SEARCH <strong><code>(?-s)^.{5}\K</code></strong></p>
<p dir="auto">REPLACE <strong><code>\x20\x20\x20\x20\x20\x20</code></strong></p>
<pre><code class="language-diff">pqrst      Smith
pqrst      Jones
pqrst      Taylor
uvwxy      Brown
fghij      Williams
fghij      Wilson
fghij      Johnson
pqrst      Davies
pqrst      Robinson
pqrst      Wright
abcde      Thompson
abcde      Evans
abcde      Walker
abcde      White
fghij      Roberts
klmno      Green
klmno      Hall
klmno      Wood
fghij      Jackson
fghij      Clarke
</code></pre>
<p dir="auto">After the <strong>Column Editor</strong> operation, at column <strong><code>9</code></strong>, we get :</p>
<pre><code class="language-diff">pqrst   01   Smith
pqrst   02   Jones
pqrst   03   Taylor
uvwxy   04   Brown
fghij   05   Williams
fghij   06   Wilson
fghij   07   Johnson
pqrst   08   Davies
pqrst   09   Robinson
pqrst   10   Wright
abcde   11   Thompson
abcde   12   Evans
abcde   13   Walker
abcde   14   White
fghij   15   Roberts
klmno   16   Green
klmno   17   Hall
klmno   18   Wood
fghij   19   Jackson
fghij   20   Clarke
</code></pre>
<p dir="auto">And, after an ascending <strong>sort</strong> :</p>
<pre><code class="language-diff">abcde   11   Thompson
abcde   12   Evans
abcde   13   Walker
abcde   14   White
fghij   05   Williams
fghij   06   Wilson
fghij   07   Johnson
fghij   15   Roberts
fghij   19   Jackson
fghij   20   Clarke
klmno   16   Green
klmno   17   Hall
klmno   18   Wood
pqrst   01   Smith
pqrst   02   Jones
pqrst   03   Taylor
pqrst   08   Davies
pqrst   09   Robinson
pqrst   10   Wright
uvwxy   04   Brown
</code></pre>
<p dir="auto">Then, the <strong>regex</strong> S/R, where we just change number <strong><code>15</code></strong> by number <strong><code>5</code></strong>, <strong>suppresses</strong> all the <strong>duplicate</strong> lines, but <strong>one</strong> :</p>
<p dir="auto">SEARCH <strong><code>(?-s)(.{5}).*\R\K(?:\1.*\R)+</code></strong></p>
<p dir="auto">REPLACE <strong><code>EMPTY</code></strong></p>
<p dir="auto">And give us :</p>
<pre><code class="language-diff">abcde   11   Thompson
fghij   05   Williams
klmno   16   Green
pqrst   01   Smith
uvwxy   04   Brown
</code></pre>
<p dir="auto">Now, the following <strong>regex</strong> S/R, <strong>swap</strong> the <strong>line numbers</strong> area and the <strong>key</strong> area ( the <strong>first 5th</strong> characters )</p>
<p dir="auto">SEARCH <strong><code>^(?-s)(.+?)\x20+(\d+)</code></strong></p>
<p dir="auto">REPLACE <strong><code>\2\x20\x20\x20\1</code></strong></p>
<pre><code class="language-diff">11   abcde   Thompson
05   fghij   Williams
16   klmno   Green
01   pqrst   Smith
04   uvwxy   Brown
</code></pre>
<p dir="auto">And, after a <strong>last</strong> ascending <strong>sort</strong> operation, we get :</p>
<pre><code class="language-diff">01   pqrst   Smith
04   uvwxy   Brown
05   fghij   Williams
11   abcde   Thompson
16   klmno   Green
</code></pre>
<p dir="auto">Finally, a last <strong>regex</strong> S/R, below, <strong>delete</strong> all the <strong>line numbers</strong>, at <strong>beginning</strong> of lines :</p>
<p dir="auto">SEARCH <strong><code>^\d+\x20+</code></strong></p>
<p dir="auto">REPLACE <strong><code>EMPTY</code></strong></p>
<pre><code class="language-diff">pqrst   Smith
uvwxy   Brown
fghij   Williams
abcde   Thompson
klmno   Green
</code></pre>
]]></description><link>https://community.notepad-plus-plus.org/post/27916</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/27916</guid><dc:creator><![CDATA[guy038]]></dc:creator><pubDate>Tue, 07 Nov 2017 22:10:54 GMT</pubDate></item><item><title><![CDATA[Reply to Deleting lines that repeat the first 15 characters on Mon, 06 Nov 2017 23:17:20 GMT]]></title><description><![CDATA[<p dir="auto">Thank you for the reply.</p>
<p dir="auto">The file is not intended to be sorted. Sorting the file would corrupt the necessary order of the data.</p>
<p dir="auto">Nevertheless, lines that duplicate the first 15 characters of the preceding line must be eliminated.</p>
<p dir="auto">Thank you again,</p>
<p dir="auto">Douglas</p>
]]></description><link>https://community.notepad-plus-plus.org/post/27907</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/27907</guid><dc:creator><![CDATA[mangoguy]]></dc:creator><pubDate>Mon, 06 Nov 2017 23:17:20 GMT</pubDate></item><item><title><![CDATA[Reply to Deleting lines that repeat the first 15 characters on Mon, 06 Nov 2017 21:30:05 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/195">@guy038</a></p>
<p dir="auto">I guess I’m confused.  The OP asked for “a line is deleted if the first 15 characters of a line match the first 15 characters of the preceding line”–doesn’t that preclude doing a sort, because it removes the impact of the “preceding line” part?  Well, no matter…if it helps the OP out that is all that matters.  However, I’m still confused as to why the application of the regexes on the unsorted file caused the entire file to be selected for the OP and to be redmarked for me–can you help with the explanation of that?</p>
]]></description><link>https://community.notepad-plus-plus.org/post/27903</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/27903</guid><dc:creator><![CDATA[Scott Sumner]]></dc:creator><pubDate>Mon, 06 Nov 2017 21:30:05 GMT</pubDate></item><item><title><![CDATA[Reply to Deleting lines that repeat the first 15 characters on Mon, 06 Nov 2017 20:48:55 GMT]]></title><description><![CDATA[<p dir="auto">Hi, <a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/10281">@mangoguy</a>, <a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/374">@scott-sumner</a> and <strong>All</strong>,</p>
<p dir="auto">I understood what happened :-)) Just have a look, for instance, to lines <strong><code>27</code></strong> and <strong><code>28</code></strong> of your <strong>DATA.txt</strong> file, as described below :</p>
<pre><code class="language-diff">08,28,1212,3959,0.458,0.458,0.504,0.492,0
08,28,1212,1000,0.492,0.364,0.495,0.365,0
</code></pre>
<p dir="auto">Obviously, these <strong>two</strong> lines are <strong>NOT</strong> sorted as string “3959” is <strong>greater</strong> than string “1000” !</p>
<p dir="auto">So, I simply performed, <strong>FIRST</strong>, the classical <strong>sort</strong> : <strong>Edit  &gt; Lines operations &gt; Sort Lines Lexicographically Ascending</strong> ( <strong><code>12s</code></strong> )</p>
<p dir="auto">And then…, everything went <strong>fine</strong> :-D I tried my <strong>two</strong> regexes, which, <strong>both</strong>, worked, as expected !</p>
<p dir="auto">BTW, My <strong>second</strong> regex, with the <strong><code>\K</code></strong> syntax is <strong>slightly quicker</strong> than the <strong>first</strong> one ! On my laptop, I got <strong><code>49s</code></strong>  instead of <strong><code>53s</code></strong>  :-)</p>
<hr />
<p dir="auto">Some <strong>statistics</strong> about your <strong>DATA.txt</strong> file and about the <strong>regex</strong> S/R :</p>
<pre><code class="language-diff">        0 line  with 4 DUPLICATES or MORE  =&gt;        0 REPLACEMENT   and       0 line DELETED

      489 lines with 3 DUPLICATES          =&gt;      489 REPLACEMENTS  and   1,467 lines DELETED  ( 3 x 489 )

       38 lines with 2 DUPLICATES          =&gt;       38 REPLACEMENTS  and      76 lines DELETED  ( 2 x  38 )

   28,836 lines with 1 DUPLICATE           =&gt;   28,836 REPLACEMENTS  and  28,836 lines DELETED

                                   TOTAL   :    29,363 REPLACEMENTS  and  30,379 lines DELETED

    ORIGINAL file NUMBER of lines      460,725
	
	After SUPPRESSION of DUPLICATES  : 430,346

                       Difference  :    30,379   
</code></pre>
<p dir="auto">Cheers,</p>
<p dir="auto">guy038</p>
]]></description><link>https://community.notepad-plus-plus.org/post/27900</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/27900</guid><dc:creator><![CDATA[guy038]]></dc:creator><pubDate>Mon, 06 Nov 2017 20:48:55 GMT</pubDate></item><item><title><![CDATA[Reply to Deleting lines that repeat the first 15 characters on Mon, 06 Nov 2017 19:33:17 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/10281">@mangoguy</a> said:</p>
<blockquote>
<p dir="auto"><a href="https://mangoguy.sharefile.com/d-s55a1b6f522c41eea" rel="nofollow ugc">https://mangoguy.sharefile.com/d-s55a1b6f522c41eea</a></p>
</blockquote>
<p dir="auto">FWIW, I can confirm the findings of the entire file being selected (I used [red]marking) with either of <a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/195">@guy038</a>’s regexes.  However, chopping up the file into 100,000 line subfiles allowed both regexes to work fine.  Note that the only duplicates were found in the LAST subfile chunk–which had only ~61,000 lines, not 100,000.</p>
<p dir="auto">I don’t have an explanation–“big” data can cause “big” problems–the definition of “big” being one thing where YMMV…</p>
<p dir="auto">In such a case I myself would turn to a scrap bit of Python to find these duplicate records:</p>
<pre><code>prev = ''
with open('data.txt') as f:
    for (n, line) in enumerate(f):
        if line[:15] == prev:
            print n+1
        prev = line[:15]
</code></pre>
<p dir="auto">That doesn’t address the issue, but gets the job done.</p>
<h3>:(</h3>
]]></description><link>https://community.notepad-plus-plus.org/post/27893</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/27893</guid><dc:creator><![CDATA[Scott Sumner]]></dc:creator><pubDate>Mon, 06 Nov 2017 19:33:17 GMT</pubDate></item><item><title><![CDATA[Reply to Deleting lines that repeat the first 15 characters on Mon, 06 Nov 2017 18:38:35 GMT]]></title><description><![CDATA[<p dir="auto">The data file should be accessible here:</p>
<p dir="auto"><a href="https://mangoguy.sharefile.com/d-s55a1b6f522c41eea" rel="nofollow ugc">https://mangoguy.sharefile.com/d-s55a1b6f522c41eea</a></p>
<p dir="auto">Thank you,<br />
Douglas</p>
]]></description><link>https://community.notepad-plus-plus.org/post/27892</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/27892</guid><dc:creator><![CDATA[mangoguy]]></dc:creator><pubDate>Mon, 06 Nov 2017 18:38:35 GMT</pubDate></item><item><title><![CDATA[Reply to Deleting lines that repeat the first 15 characters on Sat, 04 Nov 2017 12:16:32 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/10281">@mangoguy</a> aka “Doug” :</p>
<p dir="auto">If your data is not sensitive and you could post it somewhere (example, <a href="http://textuploader.com" rel="nofollow ugc">textuploader.com</a> although not sure of its limits on size…), someone will try to duplicate your findings.</p>
]]></description><link>https://community.notepad-plus-plus.org/post/27836</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/27836</guid><dc:creator><![CDATA[Scott Sumner]]></dc:creator><pubDate>Sat, 04 Nov 2017 12:16:32 GMT</pubDate></item><item><title><![CDATA[Reply to Deleting lines that repeat the first 15 characters on Sat, 04 Nov 2017 03:28:15 GMT]]></title><description><![CDATA[<p dir="auto">Thank you.</p>
<p dir="auto">Stopping the periodic backup did not resolve the issue.</p>
<p dir="auto">It appears that the size of the file is not the problem but rather how many lines are searched between the initial location of the cursor and the first match.</p>
<p dir="auto">About 100,000 lines appears to be the limit before the bizarre behavior described recurs.</p>
<p dir="auto">Maybe a buffer - memory limitation of Notepad++ ? Just guessing. I can work around this limitation.</p>
<p dir="auto">I am very thankful and impressed with how well your regex syntax cleaned this data in Notepad++.</p>
<p dir="auto">Thank you again!</p>
<p dir="auto">Doug</p>
]]></description><link>https://community.notepad-plus-plus.org/post/27833</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/27833</guid><dc:creator><![CDATA[mangoguy]]></dc:creator><pubDate>Sat, 04 Nov 2017 03:28:15 GMT</pubDate></item><item><title><![CDATA[Reply to Deleting lines that repeat the first 15 characters on Fri, 03 Nov 2017 17:55:40 GMT]]></title><description><![CDATA[<p dir="auto">Hello, <a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/10281">@mangoguy</a>,</p>
<p dir="auto"><strong>Doug</strong>, may be, you’re using the <strong>periodic backup</strong> functionality ? If so, it would be sensible to <strong>stop</strong> it, unticking the option <strong>Settings &gt; Preferences… &gt; Backup &gt; Enable session snapshot and periodic backup</strong>, while performing the <strong>regex</strong> S/R on <strong>huge</strong> files !</p>
<p dir="auto">Secondly, the <strong>second</strong> regex syntax <strong><code>(?-s)(.{15}).*\R\K(?:\1.*\R)+</code></strong>, of my <strong>previous</strong> post, may produce <strong>better</strong> results :-)</p>
<p dir="auto">Cheers,</p>
<p dir="auto">guy038</p>
]]></description><link>https://community.notepad-plus-plus.org/post/27822</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/27822</guid><dc:creator><![CDATA[guy038]]></dc:creator><pubDate>Fri, 03 Nov 2017 17:55:40 GMT</pubDate></item><item><title><![CDATA[Reply to Deleting lines that repeat the first 15 characters on Fri, 03 Nov 2017 15:59:38 GMT]]></title><description><![CDATA[<p dir="auto">Works great! Thank you!</p>
<p dir="auto">However, bizarre behavior was noted with a file longer than 143,353 lines. The entire file was selected, 1 replacement was executed, and 15 characters remained in the file.</p>
<p dir="auto">Is this problem fixable?</p>
<p dir="auto">Thank you,<br />
Doug</p>
]]></description><link>https://community.notepad-plus-plus.org/post/27817</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/27817</guid><dc:creator><![CDATA[mangoguy]]></dc:creator><pubDate>Fri, 03 Nov 2017 15:59:38 GMT</pubDate></item><item><title><![CDATA[Reply to Deleting lines that repeat the first 15 characters on Fri, 03 Nov 2017 11:21:33 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/195">@guy038</a></p>
<p dir="auto">Probably worth pointing out (again) that if you use the regex with the <code>\K</code> syntax, interactive replaces don’t work correctly–you have to use <strong>Replace All</strong></p>
]]></description><link>https://community.notepad-plus-plus.org/post/27814</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/27814</guid><dc:creator><![CDATA[Scott Sumner]]></dc:creator><pubDate>Fri, 03 Nov 2017 11:21:33 GMT</pubDate></item><item><title><![CDATA[Reply to Deleting lines that repeat the first 15 characters on Fri, 03 Nov 2017 17:46:16 GMT]]></title><description><![CDATA[<p dir="auto">Hello <a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/10281">@mangoguy</a> and <strong>All</strong>,</p>
<p dir="auto">No problem ! You may use the following <strong>regex</strong> S/R :</p>
<p dir="auto">SEARVH <strong><code>(?-s)((.{15}).*\R)(?:\2.*\R)+</code></strong></p>
<p dir="auto">REPLACE <strong><code>\1</code></strong></p>
<p dir="auto">So, let’s imagine the <strong>sorted</strong> example text, below :</p>
<pre><code class="language-diff">--------------- a test
--------------- is
--------------- just
--------------- this
000001111122222 qrstu
12345 abcde
12345 abcde
123456789012345
123456789012345 abcde
123456789012345 fghij
123456789012345 klmop
123456789012345 qrstu
99999 abcde
abcde 12345
abcde 12345
abcdefghijklmno 01
abcdefghijklmno 11111
abcdefghijklmno 22222
abcdefghijklmno 33333
abcdefghijklmno 56789
end of the test
</code></pre>
<p dir="auto">After performing the <strong>regex</strong>, above, you should obtain :</p>
<pre><code class="language-diff">--------------- a test
000001111122222 qrstu
12345 abcde
12345 abcde
123456789012345
99999 abcde
abcde 12345
abcde 12345
abcdefghijklmno 01
end of the test
</code></pre>
<hr />
<p dir="auto"><strong>Notes</strong> :</p>
<ul>
<li>
<p dir="auto">The <strong><code>(?s)</code></strong> in-line <strong>modifier</strong> ensures you that the special regex <strong>dot</strong> character will match <strong>standard</strong> characters, only, even if you, <strong>previously</strong>, ticked the <strong>. matches newline</strong> option !</p>
</li>
<li>
<p dir="auto">The part <strong><code>((.{15}).*\R)</code></strong> represents the first line of the current <strong>matched</strong> block of <strong>identical</strong> lines, stored as <strong>group 1</strong></p>
<ul>
<li>
<p dir="auto">The subpattern <strong><code>.{15}</code></strong> stands for the <strong>first fifteenth</strong> characters of the line, stored as <strong>group 2</strong></p>
</li>
<li>
<p dir="auto">The part <strong><code>.*\R</code></strong> looks for the rest of the line, possibly <strong>empty</strong>, followed by its <strong>End of Line</strong> character(s)</p>
</li>
</ul>
</li>
<li>
<p dir="auto">Finally, the part <strong><code>(?:\2.*\R)+</code></strong> is a <strong>repeated non-capturing</strong> group of :</p>
<ul>
<li>
<p dir="auto"><strong><code>\2</code></strong>, which represents the <strong>first fifteen</strong> characters, of the <strong>first</strong> line</p>
</li>
<li>
<p dir="auto">followed by <strong>any</strong> range of character(s) and the <strong>EOL</strong> character(s) of the <strong>subsequent</strong> lines</p>
</li>
</ul>
</li>
<li>
<p dir="auto">In <strong>Replacement</strong> , the <strong>group 1</strong> ( <strong><code>\1</code></strong>) , <strong>first</strong> line of the block, is rewritten, <strong>only</strong></p>
</li>
<li>
<p dir="auto">Note that the <strong>special</strong> symbol <strong><code>^</code></strong>, after <strong><code>(?-s)</code></strong>,  is <strong>not</strong> necessary, anyway, as <strong>group 2</strong> must occur after the <strong>EOL</strong> characters ( <strong><code>\R</code></strong> ) of the <strong>first</strong> line of the block !</p>
</li>
</ul>
<hr />
<p dir="auto">An <strong>other</strong> syntax could be :</p>
<p dir="auto">SEARCH <strong><code>(?-s)(.{15}).*\R\K(?:\1.*\R)+</code></strong></p>
<p dir="auto">REPLACE <strong><code>EMPTY</code></strong></p>
<p dir="auto"><strong>Notes</strong> :</p>
<ul>
<li>
<p dir="auto">This time, after matching the <strong>first</strong> line of a block of <strong>identical lines</strong> ( regarding the <strong>first 15</strong> characters, only ! ), the <strong><code>\K</code></strong> syntax <strong>resets</strong> the regex engine <strong>position</strong> and everything <strong>already</strong> matched is <strong>forgotten</strong> !</p>
</li>
<li>
<p dir="auto">So, the <strong>final</strong> match is the range of  all the <strong>duplicate</strong> lines, <strong>AFTER</strong> the <strong>first</strong> one, which are, simply, <strong>deleted</strong>, because of the <strong>empty</strong> replacement zone :-)</p>
</li>
<li>
<p dir="auto">IMPORTANT : see <strong>Scott</strong>’s advice, in the <strong>next</strong> post !</p>
</li>
</ul>
<p dir="auto">Best Regards,</p>
<p dir="auto">guy038</p>
]]></description><link>https://community.notepad-plus-plus.org/post/27811</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/27811</guid><dc:creator><![CDATA[guy038]]></dc:creator><pubDate>Fri, 03 Nov 2017 17:46:16 GMT</pubDate></item></channel></rss>