Merge 2 Files - Lines Containing Same



  • Hello need to merge 2 different textfiles and lines containing the same should be in one line

    Example

    Text File Nr 1

    Test0 000027e23c421aeec283c0b491adb97a
    Test1 0000660f57cad07bc2d56a752ab1b051
    Test2 0000f78a8b2c0a5d71c1f651f7959bab
    Test3 0001034369cf3f54df7b537b9459d978

    Text File Nr 2
    4424 000027e23c421aeec283c0b491adb97a
    5678 0000660f57cad07bc2d56a752ab1b051
    9101 0000f78a8b2c0a5d71c1f651f7959bab
    3442 0001034369cf3f54df7b537b9459d978

    Result Should be


    Test0 000027e23c421aeec283c0b491adb97a 4424 (If more matches ->here all other text)
    Test1 0000660f57cad07bc2d56a752ab1b051 5678
    Test2 0000f78a8b2c0a5d71c1f651f7959bab 9101
    Test3 0001034369cf3f54df7b537b9459d978 3442


    it should be match the Hash wich is always 32 lengh.

    Thank you for your help



  • @Shesh-Nioice said in Merge 2 Files - Lines Containing Same:

    merge 2 different textfiles and lines containing the same should be in one line

    If you were able to change the order of each line such that the data was first followed by the “Test0” and “4424” portions, like this:
    000027e23c421aeec283c0b491adb97a Test0
    then the data could be merged and then sorted. The same data would then appear together on consecutive lines. Then it is a simple matter to process them into the format you wish. However if the lines neeed to remain in the same order (test0, test1, test2 etc), then it is still possible but it will be require more steps.

    Terry



  • @Terry-R said in Merge 2 Files - Lines Containing Same:

    However if the lines neeed to remain in the same order

    Actually, just thinking a bit more. If the order of the lines once processed was required to be in the same order as the original “Text File Nr 1” and this was already in an alphabetical order (as represented by Test0, Test1, Test2, etc) then no more steps would be necessary as the final formatting of the data (putting “Test0”, “Test1” etc back at the start of the line) would allow for another final sort returning the lines to the correct order.

    So:

    1. is the final order of lines important or not?
    2. is the original order in Text File Nr 1 already in order (alphabetical, numerical?)?

    Terry



  • @Terry-R Hey Terry thank you for your reply

    The format can be changed to:
    000027e23c421aeec283c0b491adb97a Test0
    for both files this shouldnt be a problem.

    The final order of lines are not important and the Text in the files is not sorted.

    Important for me is That the text which contains the same hash is merged in the endresult.
    Sorting the hash doesnt help me because I have less textlines in file nr1 than in file nr2.

    So sorting would not help because if I would sort them and put the lines togheter it wouldnt match.
    Thank you.



  • @Shesh-Nioice said in Merge 2 Files - Lines Containing Same:

    So sorting would not help because if I would sort them and put the lines togheter it wouldnt match

    True if you are referring to the original line makeup (Test0 at the front). If however the makeup is changed (you just suggested it is OK to do so and put the Test0 and 4424 at the end) then a sort WILL match Hash data.

    So if you like I can mock up some steps and regular expressions to do most of the work for you. You just need to press some buttons and do some minor key entry work.

    Just as a matter of interest how many lines do each of the files contain?

    Terry



  • @Terry-R This would be a huge help for me. If you have free time left. Thank you
    File 1 contains 28k lines
    File 2 contains 900k lines

    Only need the 28k matches lines the rest can be deleted.



  • @Shesh-Nioice said in Merge 2 Files - Lines Containing Same:

    This would be a huge help for me.

    First off, combine the 2 files, so copy the contents of one file into the other file, doesn’t matter where. I’m assuming “Test0” and “4424” are real representations of the “names” at the start of the lines.

    Every regex (regular expression) below requires that the “search mode” is set to regular expression, VERY IMPORTANT!

    Second we need to reformat the lines so the “Test0” etc is at the end of the line. So use the following regex (regular expression) in the “Replace” function.
    Find What:(?-s)^(\w+)(\s)(\w+)$
    Replace With:\3\2\1
    Hit the “Replace All” button

    Now we need to sort the lines so the Hash data is together when duplicated. Use the builtin function under “Edit” menu, then “Line Operations”, then “sort lexicographically descending” (this puts Test0, Test1 etc first for any duplicate sequence).

    Next we combine lines when the Hash is the same. So again use the “Replace” function.
    Find What:(?-s)^(\w+)(\s.+)(\R)\1(\s.+)(\R|\z)
    Replace With:\1\2\4\3
    Hit the “Replace All” button

    So now that the lines are combined we need to bring the “Test0”, “Test1” etc to the front. So again use the “Replace” function"
    Find What:(?-s)(\w+)(\s)(\w+)(.+)*$
    Replace With:\3\2\1\4
    Hit the “Replace All” button

    At this point I would have (given your example data along with 1 additional line for a threesome combo)

    Test3 0001034369cf3f54df7b537b9459d978 3442 3678
    Test2 0000f78a8b2c0a5d71c1f651f7959bab 9101
    Test1 0000660f57cad07bc2d56a752ab1b051 5678
    Test0 000027e23c421aeec283c0b491adb97a 4424
    

    I see your late additional step about removing any lines which were NOT duplicated. I can supply an additional step shortly but wanted to give you what I was working on while waiting for your reply.

    I hope this works for you. It does rely heavily on your example data being a “real” representation of the data you are working with. If it is not then you need to portray the real data, or at least identify why my processes failed. We can then work on any changes to help get you the answer you seek.

    Terry



  • @Terry-R said in Merge 2 Files - Lines Containing Same:

    I see your late additional step about removing any lines which were NOT duplicated. I can supply an additional step shortly

    So to remove the NON duplicated data lines I’ve used the “Mark” function this time. It can also be done with a “Replace” function but this will give you more insight into the NPP functions and how they can help you with various tasks.

    So the Mark function is under “Search” below the “Replace” option, don’t select “Mark All”, that’s a different option again.
    We insert the regex:
    Find What:(?-s)^(\w+)\s(\w+)$
    Make sure “Bookmark Lines” is ticked, and search mode is set to “regular expression”. Click on “Mark All”. Now close this window and you will see some of the lines are marked with a blue dot at the start (default icon). At this point we can remove these lines so use “Remove Bookmarked Lines” which is under “Search”, then “Bookmark”.

    Terry



  • Hey terry again I really have no words everything worked like expected.
    Saved alot time for me big thanks great community forum and I should be start learning regular expressions very interesting and helpful.

    Thank you alot for taking the time helping me. :)
    Great work nice to see that you taking time for noobs like me.
    How I can give you a repurtation or positive feedback+ im new here.



  • @Shesh-Nioice said in Merge 2 Files - Lines Containing Same:

    How I can give you a repurtation or positive feedback+ im new here.

    Well, I see you’ve already up-voted my solution and that’s all we need to see. It is nice when posters do come back and give us feedback (of any kind). I’m glad it worked out so well on the first go. We do sometimes have to adjust if the examples provided did not give a true representation and I was concerned your “Test0”, “Test1” etc might have fallen into that.

    Thanks
    Terry



  • @Terry-R Yes I was surprised myself I tested out first if the regex match for my case in https://regex101.com/
    and It did. Everthing worked with the first try big thanks again.



  • @Shesh-Nioice said in Merge 2 Files - Lines Containing Same:

    I should be start learning regular expressions very interesting and helpful.

    To learn about regexes (regular expressions) start with our FAQ section. There are a number of helpful links to sites (I see you already found one, regex101.com). There is the manual for NPP which is on the NPP homepage. Other references are also linked within the FAQ posts.

    I was where you are now not too long ago, but I too found the regulars here were very helpful to me and now I pass that forward. As always we strive not only to help users, but to guide them so they can gain more knowledge. Regexes are brilliant but don’t forget to start small at first as it can be quite daunting to attempt to read some of the more complex regexes that are provided on this forum. The beauty of regex101.com is that it does explain what the regex is doing so that will give you some insight.

    Cheers
    Terry


Log in to reply