Delete near duplicate lines



  • And one last piece of advice for your actual problem: instead of delving deep into the intricacies of regex to get the non-adjacent to work, I might recommend using the Edit > Column Editor to auto-number the beginning of each line, then sort by the quoted-string section, remove the adjacent near-duplicates as I’d shown already, and then re-sort by the first column again, then delete the initial column. It’s often less work to do the multi-step where each step makes sense to you, rather than trying to craft the perfect one-step solution.



  • Thanks for all the advices. I understand any of your points, I thought my example was enough to let you understand. Sorry about that.

    Since this is something very new to me, and sometimes very difficult to understand, I think I should focus on near-duplicates next to each other only.

    Your first find/replace advice works perfectly on the lines of my example.
    Since the real lines are different, and your advice was thought for something different, would you mind sharing how can I get the same result on the lines below?

    In the two pairs of near-duplicate lines, your first code deletes 3 lines rather than 2.
    This is the text inside the first quotes.

    • The North Face - Triple C - Parka nero
    • The North Face - Triple C - Parka nero
    • The North Face - Triple C - Parka verde
    • The North Face - Triple C - Parka verde
    https://www.awin1.com/pclick.php?p=27555807329&a=357849&m=9606,"The North Face - Triple C - Parka nero",27555807329,10645488,https://images.asos-media.com/products/the-north-face-triple-c-parka-nero/21562504-1-tnfblack?$XXL$,"Parka di The North Face Questo articolo non fa parte della promozione Modello imbottito Cappuccio removibile Collo foderato in pile Chiusura con zip a due direzioni Tasche laterali Logo ricamato sul davanti e sul retro Tasche laterali Vestibilità classica Veste perfettamente la taglia indicata","Donna > Cappotti e Giacche > Giacche",380.00,9606,"Women's Outerwear",206,https://images2.productserve.com/?w=200&h=200&bg=white&trim=5&t=letterbox&url=ssl%3Aimages.asos-media.com%2Fproducts%2Fthe-north-face-triple-c-parka-nero%2F21562504-1-tnfblack%3F%24XXL%24&feedId=36675&k=94e5936958a0ebc83cd2a30f382350b5f649d4fc,EUR,,https://www.asos.com/it/the-north-face/the-north-face-triple-c-parka-nero/prd/21562504?browseCountry=IT&browseCurrency=EUR,EUR380.00,"The North Face",,,"380.00 €",
    
    https://www.awin1.com/pclick.php?p=27555807331&a=357849&m=9606,"The North Face - Triple C - Parka nero",27555807331,10645490,https://images.asos-media.com/products/the-north-face-triple-c-parka-nero/21562504-1-tnfblack?$XXL$,"Parka di The North Face Questo articolo non fa parte della promozione Modello imbottito Cappuccio removibile Collo foderato in pile Chiusura con zip a due direzioni Tasche laterali Logo ricamato sul davanti e sul retro Tasche laterali Vestibilità classica Veste perfettamente la taglia indicata","Donna > Cappotti e Giacche > Giacche",380.00,9606,"Women's Outerwear",206,https://images2.productserve.com/?w=200&h=200&bg=white&trim=5&t=letterbox&url=ssl%3Aimages.asos-media.com%2Fproducts%2Fthe-north-face-triple-c-parka-nero%2F21562504-1-tnfblack%3F%24XXL%24&feedId=36675&k=94e5936958a0ebc83cd2a30f382350b5f649d4fc,EUR,,https://www.asos.com/it/the-north-face/the-north-face-triple-c-parka-nero/prd/21562504?browseCountry=IT&browseCurrency=EUR,EUR380.00,"The North Face",,,"380.00 €",
    
    https://www.awin1.com/pclick.php?p=27583651913&a=357849&m=9606,"The North Face - Triple C - Parka verde",27583651913,10645480,https://images.asos-media.com/products/the-north-face-triple-c-parka-verde/21562497-1-newtaupegreen?$XXL$,"Giacca di The North Face Questo articolo non fa parte della promozione Modello imbottito Cappuccio removibile Collo alto Chiusura con zip a due direzioni Tasca interna per dispositivi elettronici Tasche laterali Bordo con elastico Fondo asimmetrico più lungo sul retro Vestibilità classica Veste perfettamente la taglia indicata","Donna > Cappotti e Giacche > Giacche",380.00,9606,"Women's Outerwear",206,https://images2.productserve.com/?w=200&h=200&bg=white&trim=5&t=letterbox&url=ssl%3Aimages.asos-media.com%2Fproducts%2Fthe-north-face-triple-c-parka-verde%2F21562497-1-newtaupegreen%3F%24XXL%24&feedId=36675&k=47ee36d903c628a7bd5bb0184718e1175767df4b,EUR,,https://www.asos.com/it/the-north-face/the-north-face-triple-c-parka-verde/prd/21562497?browseCountry=IT&browseCurrency=EUR,EUR380.00,"The North Face",,,"380.00 €",
    
    https://www.awin1.com/pclick.php?p=27583651915&a=357849&m=9606,"The North Face - Triple C - Parka verde",27583651915,10645481,https://images.asos-media.com/products/the-north-face-triple-c-parka-verde/21562497-1-newtaupegreen?$XXL$,"Giacca di The North Face Questo articolo non fa parte della promozione Modello imbottito Cappuccio removibile Collo alto Chiusura con zip a due direzioni Tasca interna per dispositivi elettronici Tasche laterali Bordo con elastico Fondo asimmetrico più lungo sul retro Vestibilità classica Veste perfettamente la taglia indicata","Donna > Cappotti e Giacche > Giacche",380.00,9606,"Women's Outerwear",206,https://images2.productserve.com/?w=200&h=200&bg=white&trim=5&t=letterbox&url=ssl%3Aimages.asos-media.com%2Fproducts%2Fthe-north-face-triple-c-parka-verde%2F21562497-1-newtaupegreen%3F%24XXL%24&feedId=36675&k=47ee36d903c628a7bd5bb0184718e1175767df4b,EUR,,https://www.asos.com/it/the-north-face/the-north-face-triple-c-parka-verde/prd/21562497?browseCountry=IT&browseCurrency=EUR,EUR380.00,"The North Face",,,"380.00 €",
    
    

    Thank you very much.



  • @fred-zept said in Delete near duplicate lines:

    Since this is something very new to me, and sometimes very difficult to understand,

    I can infer from your “english” that perhaps it isn’t your primary language so I can excuse your choice of words, hence the issues we had in understanding exactly your request. It’s good that you did eventually provide examples within the black box as we need that to better help you. Your original examples you will notice had the " altered, hence our need to see examples within the black box.

    @PeterJones had the answer which I think is the best way for you to proceed. By breaking it down in a number of steps it becomes more manageable and almost certainly easier to understand.

    So by adding line numbering we can store the original position in the file of each line even if we sort them in a different order. But of course to sort in a different order we need the primary sort key which is currently in the middle of each line.
    So the steps (as I see it) are:

    1. Add line numbering:
      To do this you need to place the cursor in the very first position on the first line, right against the left margin. Then using the column editor we add a delimiter, I used the @@ as “text to insert”. If completed correctly there will be the @@ characters at the start of EVERY line. Next we leave the cursor in that very first position and using column editor again we use it to insert numbers, starting with 1, increasing by 1 and with “leading zeroes” ticked. So now every line has a number (with leading zeroes) followed by @@ and then the original line.

    2. Add primary sort key to front of line:
      This is the string you use to define when a duplicate appears and thus the line should be removed. Using the “Replace” function (with search mode set as regular expression)
      Find What:(?-s)^(\d+@@https[^"]+)("[^"]+")
      Replace With:\2\1\2
      Hit Replace All. Now every line (excluding blank lines) starts with the string you need to sort on and will put ALL duplicates together on consecutive lines. I included the https portion so that the blank lines would not get any string added. All 4 example lines you provided have this and as long as that is correct for the remainder of the file these steps will work.

    3. Sort lines:
      Sort using “sort lines lexicographically descending” (reverses the lines). This will place a “duplicate” line before the “original” line, there is a reason for this.

    4. Remove Duplicate Lines:
      Again using the “Replace” function.
      Find What:(?-s)^("[^"]+").+\R(?=\1)
      Replace With: empty field here
      By sorting lines descending (step #3), effectively the line numbers are in reverse and we can easily remove the first line of a duplicate “pair”. The first line will be a higher line number that the second line of the pair AND the “pairs” will ALWAYS be on consecutive lines. This WILL work even if 3 (or more) lines are duplicates.

    5. Remove the primary sort key.
      “Replace” Function again.
      Find What:(?-s)^.+?(?=\d+@@)
      Replace With: empty field here

    6. Re-sort to put lines back in original order.
      Sort Lines as Integers Ascending puts remaining lines back in original order

    7. remove the line numbers and delimiter.
      “Replace” function again.
      Find What:(?-s)^\d+@@
      Replace With: empty field here

    At this point you should be back to the original look, with “duplicate” lines removed.

    Try these steps, hopefully if your examples are indicative of the file all should work. If not, then you will need to explain further and please also include additional examples where it did NOT work so we may see what changes in the steps may be necessary to overcome the problems.

    Terry



  • @Terry-R said in Delete near duplicate lines:

    I can infer from your “english” that perhaps it isn’t your primary language

    Hi @Terry-R. Exactly, that’s the reason why something is very hard to understand.

    I don’t know how to thank you for this, it worked perfectly.
    On the next requestes I’ll try to be as clear as possible.

    Thank you all for your time.



  • @fred-zept said in Delete near duplicate lines:

    your first code deletes 3 lines rather than 2.

    I know @Terry-R’s solution worked for you. But to reply to the specific complaint:

    Yes, in circumstances like you showed in your new data, when there are more than one CSV cell that has quotes around it, my regex could pick a different quoted column as the “check for same string” cell. My regex could be changed to FIND = (?-s)^(.*?(,".*?",).*$\R)(^.*\2.*$\R*)+ , adding the ? after the first .*, to make that less greedy, so it would find the first matching quoted-string rather than allowing it to grab more last matching quoted-string. (Other things could be put in, like not allowing , or " in the first match – (?-s)^([^,"]*?(,".*?",).*$\R)(^.*\2.*$\R*)+

    Really, regex in a text editor is not the right tool for CSV; a spreadsheet program is, especially one with a macro/automation language like Excel+VBA. You can make it work in Notepad++ (as Terry showed), but it takes effort.





  • @ArkadiuszMichalski

    https://github.com/notepad-plus-plus/notepad-plus-plus/issues/8965 << does not describe this problem?

    I don’t think having that would help.

    OP’s first posting said:

    Basically I need to keep only one line with TEXT1, one withTEXT4 etc. and delete the near duplicates.

    Note the use of “near duplicates”, not “exact duplicates”





  • @ArkadiuszMichalski said in Delete near duplicate lines:

    https://community.notepad-plus-plus.org/topic/20173/delete-near-duplicate-lines/4?_=1603208619351

    I don’t know what this means.
    I mean, I know it is a link but I don’t know what it is supposed to tell me, even looking at where that link points.
    The brevity of your posts often seems to produce something that isn’t really meaningful.
    We often get people that don’t want to bother typing a lot in their posts here (kind of an anti-PeterJones-type poster), and that’s a shame because people often have good thoughts – I guess the effort in expressing them is too difficult. Pity.



  • @PeterJones Thank you very much!


Log in to reply