Delete near duplicate lines



  • Hello everyone.

    I wish to delete near duplicate lines on a csv file. Lines look like.

    https://url,“TEXT0”,22991315467,036807401_157,…
    https://url,“TEXT1”,22991313777,036807401_157,…
    https://url,“TEXT1”,22991318777,036807485_157,…
    https://url,“TEXT1”,22991318777,036807444_157,…
    https://url,“TEXT3”,22991311247,036808721_157,…
    https://url,“TEXT4”,22991313777,036807401_157,…
    https://url,“TEXT4”,22991318777,036807665_157,…
    https://url,“TEXT5”,22991318777,036807857_157,…

    Basically I need to keep only one line with TEXT1, one withTEXT4 etc. and delete the near duplicates.

    Any help is much appreciated.



  • @fred-zept ,

    Is that URL always the same, or is it different line-to-line? Because someone might craft a regex that matched your example, but then expects the opposite of what you actually have, so it won’t work for you.

    Also, are those quotes real ASCII quotes, ", or are they the “smart quotes” “” that the forum shows? Because that changes the regex.

    Also, are the near-duplicates always next to each other, like you’ve shown?

    Following the advice at the end of my post will help you clarify this question, including formatting and before/after requests, will help you get better answers.

    Making one possible set of assumptions about your data (different URLs, ASCII quotes, always adjacent lines), I can craft the following. If it doesn’t work for you, you’ll need to clarify things for us.

    • FIND = (?-s)^(.*(,".*?",).*$\R)(^.*\2.*$\R*)+
    • REPLACE = $1
    • SEARCH MODE = regular expression

    converts

    https://url1,"TEXT0",22991315467,036807401_157,…
    https://url2,"TEXT1",22991313777,036807401_157,…
    https://url3,"TEXT1",22991318777,036807485_157,…
    https://url4,"TEXT1",22991318777,036807444_157,…
    https://url5,"TEXT3",22991311247,036808721_157,…
    https://url6,"TEXT4",22991313777,036807401_157,…
    https://url7,"TEXT4",22991318777,036807665_157,…
    https://url8,"TEXT5",22991318777,036807857_157,…
    

    into

    https://url1,"TEXT0",22991315467,036807401_157,…
    https://url2,"TEXT1",22991313777,036807401_157,…
    https://url5,"TEXT3",22991311247,036808721_157,…
    https://url6,"TEXT4",22991313777,036807401_157,…
    https://url8,"TEXT5",22991318777,036807857_157,…
    

    For more on how each piece of the regex works, see here, which if you click on EXPLAIN in the lower half will give the details of the individual elements of the regex used; see the links below for finding more on Notepad++'s Boost regular expressions, as well as a FAQ with other regex links.

    ----

    Do you want regex search/replace help? Then please be patient and polite, show some effort, and be willing to learn; answer questions and requests for clarification that are made of you. All example text should be marked as plain text using the </> toolbar button or manual Markdown syntax. Screenshots can be pasted from the clipboard to your post using Ctrl+V to show graphical items, but any text should be included as literal text in your post so we can easily copy/paste your data. Show the data you have and the text you want to get from that data; include examples of things that should match and be transformed, and things that don’t match and should be left alone; show edge cases and make sure you examples are as varied as your real data. Show the regex you already tried, and why you thought it should work; tell us what’s wrong with what you do get… Read the official NPP Searching / Regex docs and the forum’s Regular Expression FAQ. If you follow these guidelines, you’re much more likely to get helpful replies that solve your problem in the shortest number of tries.



  • This post is deleted!


  • Good day Peter, thanks for your reply.

    the near-duplicates are not always next to each other. Often they are like:
    -near-duplicate-one
    -near-duplicate-one
    -line1
    -line2
    -near-duplicate-one

    Quotes are ASCII and the url is always different.

    Below the exact structure of two lines. Text between the first quotes is the point.
    If it is duplicate, keep the first line and delete the others.

    Thanks again.

    https://url1,“Acid Wash Ofcl Studio Oversized Jogger, Grigio”,27546972557,FZZ00679-115-30,http://i1.adis.ws/i/boohooamplience/fzz00679_charcoal_xl.jpg,"Acid Wash Ofcl Studio Oversized Jogger - Acid Wash Ofcl Studio Oversized Jogger",“Women’s Sportswear”,19.80,“Boohoo.com IT”,9577,url.jpg,EUR,url4,EUR19.80,36761,boohoo,

    https://url2,“Acid Wash Ofcl Studio Oversized Jogger, Grigio”,27546972559,FZZ00679-115-34,http://i1.adis.ws/i/boohooamplience/fzz00679_charcoal_xl.jpg,"Acid Wash Ofcl Studio Oversized Jogger - Acid Wash Ofcl Studio Oversized Jogger",“Women’s Sportswear”,19.80,“Boohoo.com IT”,9577,url.jpg,EUR,url5,EUR19.80,36761,boohoo,



  • @fred-zept said in Delete near duplicate lines:

    the near-duplicates are not always next to each other

    Then don’t make your example that way, otherwise you will mislead people and waste their time. If you want help, your goal is to make it as easy as possible for us to help you. Note, your “exact structure of two lines” still enforces the near-duplicates-are-adjacent mis-truth.

    Also, you ignored my advice of how to format the text so it would be marked as text and the forum wouldn’t mangle it, because your data still shows smart quotes, not ASCII quotes.

    An example of the intricacies of searching for duplicates (or near-duplicates) when they aren’t adjacent is in the monster discussion in this other thread. Read it, and apply the lessons learned to the regex I’ve already given you. Give it a go. If you have trouble, come back here, show us what you tried (in combining mine with the options there), and why you thought it would work.

    Once you’ve shown some more effort and followed our advice, you will likely either find the answer on your own, or be able to express yourself in a way that we don’t have to do the heavy lifting of guessing what you actually mean. Good luck.



  • And one last piece of advice for your actual problem: instead of delving deep into the intricacies of regex to get the non-adjacent to work, I might recommend using the Edit > Column Editor to auto-number the beginning of each line, then sort by the quoted-string section, remove the adjacent near-duplicates as I’d shown already, and then re-sort by the first column again, then delete the initial column. It’s often less work to do the multi-step where each step makes sense to you, rather than trying to craft the perfect one-step solution.



  • Thanks for all the advices. I understand any of your points, I thought my example was enough to let you understand. Sorry about that.

    Since this is something very new to me, and sometimes very difficult to understand, I think I should focus on near-duplicates next to each other only.

    Your first find/replace advice works perfectly on the lines of my example.
    Since the real lines are different, and your advice was thought for something different, would you mind sharing how can I get the same result on the lines below?

    In the two pairs of near-duplicate lines, your first code deletes 3 lines rather than 2.
    This is the text inside the first quotes.

    • The North Face - Triple C - Parka nero
    • The North Face - Triple C - Parka nero
    • The North Face - Triple C - Parka verde
    • The North Face - Triple C - Parka verde
    https://www.awin1.com/pclick.php?p=27555807329&a=357849&m=9606,"The North Face - Triple C - Parka nero",27555807329,10645488,https://images.asos-media.com/products/the-north-face-triple-c-parka-nero/21562504-1-tnfblack?$XXL$,"Parka di The North Face Questo articolo non fa parte della promozione Modello imbottito Cappuccio removibile Collo foderato in pile Chiusura con zip a due direzioni Tasche laterali Logo ricamato sul davanti e sul retro Tasche laterali Vestibilità classica Veste perfettamente la taglia indicata","Donna > Cappotti e Giacche > Giacche",380.00,9606,"Women's Outerwear",206,https://images2.productserve.com/?w=200&h=200&bg=white&trim=5&t=letterbox&url=ssl%3Aimages.asos-media.com%2Fproducts%2Fthe-north-face-triple-c-parka-nero%2F21562504-1-tnfblack%3F%24XXL%24&feedId=36675&k=94e5936958a0ebc83cd2a30f382350b5f649d4fc,EUR,,https://www.asos.com/it/the-north-face/the-north-face-triple-c-parka-nero/prd/21562504?browseCountry=IT&browseCurrency=EUR,EUR380.00,"The North Face",,,"380.00 €",
    
    https://www.awin1.com/pclick.php?p=27555807331&a=357849&m=9606,"The North Face - Triple C - Parka nero",27555807331,10645490,https://images.asos-media.com/products/the-north-face-triple-c-parka-nero/21562504-1-tnfblack?$XXL$,"Parka di The North Face Questo articolo non fa parte della promozione Modello imbottito Cappuccio removibile Collo foderato in pile Chiusura con zip a due direzioni Tasche laterali Logo ricamato sul davanti e sul retro Tasche laterali Vestibilità classica Veste perfettamente la taglia indicata","Donna > Cappotti e Giacche > Giacche",380.00,9606,"Women's Outerwear",206,https://images2.productserve.com/?w=200&h=200&bg=white&trim=5&t=letterbox&url=ssl%3Aimages.asos-media.com%2Fproducts%2Fthe-north-face-triple-c-parka-nero%2F21562504-1-tnfblack%3F%24XXL%24&feedId=36675&k=94e5936958a0ebc83cd2a30f382350b5f649d4fc,EUR,,https://www.asos.com/it/the-north-face/the-north-face-triple-c-parka-nero/prd/21562504?browseCountry=IT&browseCurrency=EUR,EUR380.00,"The North Face",,,"380.00 €",
    
    https://www.awin1.com/pclick.php?p=27583651913&a=357849&m=9606,"The North Face - Triple C - Parka verde",27583651913,10645480,https://images.asos-media.com/products/the-north-face-triple-c-parka-verde/21562497-1-newtaupegreen?$XXL$,"Giacca di The North Face Questo articolo non fa parte della promozione Modello imbottito Cappuccio removibile Collo alto Chiusura con zip a due direzioni Tasca interna per dispositivi elettronici Tasche laterali Bordo con elastico Fondo asimmetrico più lungo sul retro Vestibilità classica Veste perfettamente la taglia indicata","Donna > Cappotti e Giacche > Giacche",380.00,9606,"Women's Outerwear",206,https://images2.productserve.com/?w=200&h=200&bg=white&trim=5&t=letterbox&url=ssl%3Aimages.asos-media.com%2Fproducts%2Fthe-north-face-triple-c-parka-verde%2F21562497-1-newtaupegreen%3F%24XXL%24&feedId=36675&k=47ee36d903c628a7bd5bb0184718e1175767df4b,EUR,,https://www.asos.com/it/the-north-face/the-north-face-triple-c-parka-verde/prd/21562497?browseCountry=IT&browseCurrency=EUR,EUR380.00,"The North Face",,,"380.00 €",
    
    https://www.awin1.com/pclick.php?p=27583651915&a=357849&m=9606,"The North Face - Triple C - Parka verde",27583651915,10645481,https://images.asos-media.com/products/the-north-face-triple-c-parka-verde/21562497-1-newtaupegreen?$XXL$,"Giacca di The North Face Questo articolo non fa parte della promozione Modello imbottito Cappuccio removibile Collo alto Chiusura con zip a due direzioni Tasca interna per dispositivi elettronici Tasche laterali Bordo con elastico Fondo asimmetrico più lungo sul retro Vestibilità classica Veste perfettamente la taglia indicata","Donna > Cappotti e Giacche > Giacche",380.00,9606,"Women's Outerwear",206,https://images2.productserve.com/?w=200&h=200&bg=white&trim=5&t=letterbox&url=ssl%3Aimages.asos-media.com%2Fproducts%2Fthe-north-face-triple-c-parka-verde%2F21562497-1-newtaupegreen%3F%24XXL%24&feedId=36675&k=47ee36d903c628a7bd5bb0184718e1175767df4b,EUR,,https://www.asos.com/it/the-north-face/the-north-face-triple-c-parka-verde/prd/21562497?browseCountry=IT&browseCurrency=EUR,EUR380.00,"The North Face",,,"380.00 €",
    
    

    Thank you very much.



  • @fred-zept said in Delete near duplicate lines:

    Since this is something very new to me, and sometimes very difficult to understand,

    I can infer from your “english” that perhaps it isn’t your primary language so I can excuse your choice of words, hence the issues we had in understanding exactly your request. It’s good that you did eventually provide examples within the black box as we need that to better help you. Your original examples you will notice had the " altered, hence our need to see examples within the black box.

    @PeterJones had the answer which I think is the best way for you to proceed. By breaking it down in a number of steps it becomes more manageable and almost certainly easier to understand.

    So by adding line numbering we can store the original position in the file of each line even if we sort them in a different order. But of course to sort in a different order we need the primary sort key which is currently in the middle of each line.
    So the steps (as I see it) are:

    1. Add line numbering:
      To do this you need to place the cursor in the very first position on the first line, right against the left margin. Then using the column editor we add a delimiter, I used the @@ as “text to insert”. If completed correctly there will be the @@ characters at the start of EVERY line. Next we leave the cursor in that very first position and using column editor again we use it to insert numbers, starting with 1, increasing by 1 and with “leading zeroes” ticked. So now every line has a number (with leading zeroes) followed by @@ and then the original line.

    2. Add primary sort key to front of line:
      This is the string you use to define when a duplicate appears and thus the line should be removed. Using the “Replace” function (with search mode set as regular expression)
      Find What:(?-s)^(\d+@@https[^"]+)("[^"]+")
      Replace With:\2\1\2
      Hit Replace All. Now every line (excluding blank lines) starts with the string you need to sort on and will put ALL duplicates together on consecutive lines. I included the https portion so that the blank lines would not get any string added. All 4 example lines you provided have this and as long as that is correct for the remainder of the file these steps will work.

    3. Sort lines:
      Sort using “sort lines lexicographically descending” (reverses the lines). This will place a “duplicate” line before the “original” line, there is a reason for this.

    4. Remove Duplicate Lines:
      Again using the “Replace” function.
      Find What:(?-s)^("[^"]+").+\R(?=\1)
      Replace With: empty field here
      By sorting lines descending (step #3), effectively the line numbers are in reverse and we can easily remove the first line of a duplicate “pair”. The first line will be a higher line number that the second line of the pair AND the “pairs” will ALWAYS be on consecutive lines. This WILL work even if 3 (or more) lines are duplicates.

    5. Remove the primary sort key.
      “Replace” Function again.
      Find What:(?-s)^.+?(?=\d+@@)
      Replace With: empty field here

    6. Re-sort to put lines back in original order.
      Sort Lines as Integers Ascending puts remaining lines back in original order

    7. remove the line numbers and delimiter.
      “Replace” function again.
      Find What:(?-s)^\d+@@
      Replace With: empty field here

    At this point you should be back to the original look, with “duplicate” lines removed.

    Try these steps, hopefully if your examples are indicative of the file all should work. If not, then you will need to explain further and please also include additional examples where it did NOT work so we may see what changes in the steps may be necessary to overcome the problems.

    Terry



  • @Terry-R said in Delete near duplicate lines:

    I can infer from your “english” that perhaps it isn’t your primary language

    Hi @Terry-R. Exactly, that’s the reason why something is very hard to understand.

    I don’t know how to thank you for this, it worked perfectly.
    On the next requestes I’ll try to be as clear as possible.

    Thank you all for your time.



  • @fred-zept said in Delete near duplicate lines:

    your first code deletes 3 lines rather than 2.

    I know @Terry-R’s solution worked for you. But to reply to the specific complaint:

    Yes, in circumstances like you showed in your new data, when there are more than one CSV cell that has quotes around it, my regex could pick a different quoted column as the “check for same string” cell. My regex could be changed to FIND = (?-s)^(.*?(,".*?",).*$\R)(^.*\2.*$\R*)+ , adding the ? after the first .*, to make that less greedy, so it would find the first matching quoted-string rather than allowing it to grab more last matching quoted-string. (Other things could be put in, like not allowing , or " in the first match – (?-s)^([^,"]*?(,".*?",).*$\R)(^.*\2.*$\R*)+

    Really, regex in a text editor is not the right tool for CSV; a spreadsheet program is, especially one with a macro/automation language like Excel+VBA. You can make it work in Notepad++ (as Terry showed), but it takes effort.





  • @ArkadiuszMichalski

    https://github.com/notepad-plus-plus/notepad-plus-plus/issues/8965 << does not describe this problem?

    I don’t think having that would help.

    OP’s first posting said:

    Basically I need to keep only one line with TEXT1, one withTEXT4 etc. and delete the near duplicates.

    Note the use of “near duplicates”, not “exact duplicates”





  • @ArkadiuszMichalski said in Delete near duplicate lines:

    https://community.notepad-plus-plus.org/topic/20173/delete-near-duplicate-lines/4?_=1603208619351

    I don’t know what this means.
    I mean, I know it is a link but I don’t know what it is supposed to tell me, even looking at where that link points.
    The brevity of your posts often seems to produce something that isn’t really meaningful.
    We often get people that don’t want to bother typing a lot in their posts here (kind of an anti-PeterJones-type poster), and that’s a shame because people often have good thoughts – I guess the effort in expressing them is too difficult. Pity.



  • @PeterJones Thank you very much!


Log in to reply