search and replace for a newbie



  • So I have about 1000 .txt files that need to have some of the data stripped in them, each contains the similar lines of data
    "A0000058100000001 000000012016100420161004NLEE 4 110M00102 "

    the only constant in each file would be the leading “A00000” and the trailing "4 110M00102 "

    Is there an easy way to do this in the “find in files dialog” using the location of the files and then “replace in files” for the 1000 files or does it have to get a little more complicated by adding code?



  • @Barry-Payne

    Not sure why this deserved a downvote…?

    Okay, so Yes, there is an easy way… You didn’t say what, if anything, you want to replace the “stripped out” portion with, so let’s just assume it is purely a strip out…

    Find what zone: ^A00000.*?4 110M00102$
    Replace with zone: A000004 110M00102
    Search mode: Regular expression

    So this is just very basic, and chances are good when you see what it does you will have an “Ah ha” moment, and then refine what you are really trying to do…

    But anyway, this looks for the beginning of a line (the ^), then your leading text (A00000), then a minimal run of any characters (the .*?), then your trailing text (…you get the idea…) at the end of the line (the $ specifies the end of line).



  • @Scott-Sumner I cant thank you enough, not familiar enough with the notepad ++ and the operators, I was just trying to use the * wildcard and was not having much luck, and yes its purely a strip out to remove the unwanted data from the file extract generation. I appreciate it very much and will let you know how it goes!



  • ok still not doing something right…



  • @Barry-Payne said:

    still not doing something right…

    If you want additional help, you’ll have to provide more specifics than that…



  • @Scott-Sumner well I was trying to figure out away to put a pic in here … I’ll try to explain it

    so I did the find in files
    find what: ^A00000.*?4 110M00102$
    replace with:
    filter defaulted to .
    put the directory in as to where the files are
    regular expression

    Search “^A00000.*?4 110M00102$” (0 hits in 0 files)



  • @Barry-Payne

    I would “start simpler”…try a find (not replace yet) operation on a single file that you have loaded into an editor tab window. Save the Replace-in-Files for when you have totally debugged the operation on a single file…

    If you want to embed an image in a posting on this site, then this thread gives an example of how to do so.



  • @Scott-Sumner so thats the thing when I search for a file specifically such as:

    A0000062700000001 000000012016102820161028NLEE 4 110M00104

    I get the following results

    Search “A0000062700000001 000000012016102820161028NLEE 4 110M00104” (68 hits in 1 file)

    its the generic search I cant seem to master, i would have to go into each file and copy and paste, but since I’m already in the file why not just do it there other than it would take me two days for the amount of files



  • First, an aside: your most recent example could not have matched the previous pattern, because in the previous example, it was ending in 102, now it’s ending in 104. Changing things midstream makes it more difficult for us to help you, especially when you don’t even point out you changed it. For all we know, the reason that ^A00000.*?4 110M00102$ didn’t find anything is because your lines all ended in 104, so should have been ^A00000.*?4 110M00104$.

    I’m not sure what you’re trying to say with the “i would have to go into each file and copy and paste”; if you’re complaining that one file at a time will take too long, you didn’t understand Scott. The reason Scott rightly suggested you start with just a find is because if you cannot get the find to work, you are not going to be able to find and replace across multiple files. So he suggested simplifying. Once the simple is working, then you can try to make it more complicated.

    Now that we know you can successfully search for one of the long strings, let’s go the next step. We’ll start with a known file that we can easily replicate (to help you). If I take the two strings you’ve given us,

    A0000062700000001 000000012016102820161028NLEE 4 110M00104
    A0000058100000001 000000012016100420161004NLEE 4 110M00102
    

    and make a file with 20 copies of each.

    Now, in the FIND dialog, search for the same string you just successfully searched for. If you do the COUNT, it should find 20. Then change the string to ^A00000.*?4 110M00104$. COUNT should still give you 20. Change the string to ^A00000.*?4 110M00102$ – this should find the other 20 instances in this dummy file. At this time I took a screenshot (below). (I realized while typing this up, I should have made the dummy file with different numbers of each line, so the counts would change, but I don’t want to regenerate the file at this point)

    Screenshot (image embedded as ![](https://i.imgur.com/7h05zUe.png)):

    Now, if that’s working, you can try the “Replace” tab in one file. Then, once that’s working for you, you can use the “Find in Files” to do multiple files at once.



  • ok so I’ll try to be more clear, some will end in 2 and some will end in 4, I thought if the original post with a 2 worked then I could change it to 4 and it would work as well…

    there are about 1000 files downloaded as .txt they were dumped by a program that used ASCII to generate the files in a batch mode, could the encoding be an issue?

    this is an example of lines from inside one file

    A0000059700000001 000000012016101020161010NLEE 4 110M00104 DL2 Cross Roller Monthly PM
    A0000059700000001 000000012016101020161010NLEE 4 110M00104
    A0000059700000001 000000012016101020161010NLEE 4 110M00104 SAFETY FIRST- USE LOCK OUT/TAG OUT PRIOR
    A0000059700000001 000000012016101020161010NLEE 4 110M00104 TO DOING THIS JOB
    A0000059700000001 000000012016101020161010NLEE 4 110M00104
    A0000059700000001 000000012016101020161010NLEE 4 110M00104 Determine what energy sources will be
    A0000059700000001 000000012016101020161010NLEE 4 110M00104 locked out. (Electrical, Gas, Pneumatic,
    A0000059700000001 000000012016101020161010NLEE 4 110M00104 Hydraulic, Steam, Etc#)
    A0000059700000001 000000012016101020161010NLEE 4 110M00104
    A0000059700000001 000000012016101020161010NLEE 4 110M00104 IF YOU ARE NOT SURE WITH THE ABOVE- SEE
    A0000059700000001 000000012016101020161010NLEE 4 110M00104 SUPERVISOR IMMEDIATELLY

    so the find/count option you listed above works great with the long file like
    A0000059700000001 000000012016101020161010NLEE 4 110M00104

    again its the generic I can get to work no matter the combination I use, and I’ve played with quite a few combinations and versions of ^A00000.*?4 110M00104$

    Sorry I’m not catching on fast enough



  • Don’t about the 1000 yet; ignore multiple files; we haven’t gotten that far, as I said before. Until you give us good enough data to make a valid search string, the 1000 files, or the replace field, doesn’t matter.

    Until you can get the generic match working, nothing else matters.

    Given your new data set, when you tell the match that the line must end after the 110M00104 by ending it with a $, of course it’s not going to find any of the lines that contain information after the 110M00104

    Let’s take a step back to regular expressions.

    First, in my image, you will see that ☑ Regular Expression is selected. Is that what you have? If not, you will need it, otherwise the special characters we are using will not do the wildcard matching we desire.

    Next, do you understand what the ^ and $ and .*? do in the regular expression? I am thinking not, given your terminology, and the sudden surprise of data after the 110M00104.

    • ^ matches begininning of line. So the line must start with whatever follows the ^, or it will not match. No spaces, no tabs, just that A00000 sequence. Do you have spaces at the beginning of the line, or is the A really the first character?
    • $ matches end of line. So the line must end with whatever comes before the $… in this case, 110M00104. If the line doesn’t end with that, it cannot match.
    • . is a wildcard that matches any single character
    • * makes the previous character (or wildcard) match 0 or more instances. So .* means “match 0 or more instances of any character”
    • ? makes the previous character or wildcard non-greedy, so .*? means “match 0 or more instances of any character, but don’t be greedy about it”.
    • wrapping characters in [] will allow any of those characters to match. For example, if we want to match 4 110M00104 or 4 110M00102, we could write that as 4 110M0010[24]. I will be using that notation below, so that’s why I explained it, even though we hadn’t seen it yet.

    Now, let’s answer some questions:

    1. Will there ever be spaces or tabs or other characters before the A of the A00000 which you said was a constant?
    2. Will there ever be anything different (B00000, or A12345), or is A00000 really a constant.
    3. You have mentioned 4 110M00104 and 4 110M00102 as constant “end-markers”. Are there any others? Are there any other restrictions (will there always be a space before the 4? or a space-or-tab? anything else we should know?)
    4. Can anything occur after that? You have shown some sort of text after, so I am now thinking the answer will be “yes”, but maybe this new data isn’t representative, either, so I cannot really assume anything. If something can come after, are there any restrictions as to what comes after? Will there always be a space between the 4 110M0010[24] and the extra text, or not?
    5. Are there any restrictions as to what comes between the A00000 and the 4 110M0010[24]? Can it contain the string 4 110M0010[24] or not? You always seem to show the same number of characters; will that always be true, or can it be longer or shorter? Will there ever be more or fewer spaces in the intervening text or not?
    6. Will any of these matches ever span two or more lines (ie, will the stuff between the A00000 and the 4 110M0010[24], or the stuff after the 4 110M0010[24], ever contain newlines?)
    7. Is there any other characteristic or feature or restriction or extra format of the data that you’ve forgotten to mention that I haven’t asked about?

    Once we get the right data matching format, and can get it consistently matching what you think are the right number of lines in your data, then we will start working on the replace requirements. Only after we’ve got the search and replace working in one file will we even consider moving on to multiple files. I would love to give it to you all at once, but these surprise-requirements and differences in data rows are making that impossible.



  • So after playing around and searching the web I ended up with this

    ^A.*110M00104(?!.*110M00104$)

    the count works
    then I went to find in file and it worked as well

    going to copy the folder with the files and test

    Thanks for the help…



  • Glad it’s working for you.


    For extra regex information, here are some good links, compiled and often quoted by @guy038, the forum’s regex guru (I am probably going to start quoting/paraphrasing Guy’s frequent postscript, because it’s an awesome summary of regex information and links.)

    Here is a good starting point for NPP users unfamiliar with regular expression concepts and syntax:

    Modern Notepad++ (since v6.0) uses the Boost C++ Regex library, v1.55.0 (similar to the Perl Regular Common Expressions (PRCE), v5.8):

    Other sites with valuable regular expression information include:



  • @PeterJones and @ScottSumner I appreciate the information about the expressions as your right I wasn’t familiar with them, they helped me to finally get the count working, after that I moved forward with the find in file, I really appreciate you guys taking time out to help me understand the software and how much more useful it can be.

    is there some way to reward points or something?



  • @Barry-Payne said:

    reward points or something?

    Yes, we prefer bitcoin … :-D
    …or just “upvote”: See the little ^ 0 v to the right of every post? It might say ^ 1 v instead if someone else “liked” the posting previously. PRESS THE ^ ! Help me catch @Claudia-Frank…just kidding, I could never surpass CF…

    Which leads back to my original question on why someone decided your original posting in this thread was worth a “downvote”…?

    @PeterJones thanks for picking up this thread and bringing it to an apparently successful conclusion…I got busy and couldn’t get back to it til now…



  • Hello, @barry-payne and All,

    Sorry to discuss about an already solved problem but I do think that your regex ^A.*110M00104(?!.*110M00104$) could be simply written :

    ^A.*110M00104

    Indeed, given your text, these two regexes give the same results : all lines from A0000 till 110M00104 are matched !

    A0000059700000001 000000012016101020161010NLEE 4 110M00104 DL2 Cross Roller Monthly PM
    A0000059700000001 000000012016101020161010NLEE 4 110M00104
    A0000059700000001 000000012016101020161010NLEE 4 110M00104 SAFETY FIRST- USE LOCK OUT/TAG OUT PRIOR
    A0000059700000001 000000012016101020161010NLEE 4 110M00104 TO DOING THIS JOB
    A0000059700000001 000000012016101020161010NLEE 4 110M00104
    A0000059700000001 000000012016101020161010NLEE 4 110M00104 Determine what energy sources will be
    A0000059700000001 000000012016101020161010NLEE 4 110M00104 locked out. (Electrical, Gas, Pneumatic,
    A0000059700000001 000000012016101020161010NLEE 4 110M00104 Hydraulic, Steam, Etc#)
    A0000059700000001 000000012016101020161010NLEE 4 110M00104
    A0000059700000001 000000012016101020161010NLEE 4 110M00104 IF YOU ARE NOT SURE WITH THE ABOVE- SEE
    A0000059700000001 000000012016101020161010NLEE 4 110M00104 SUPERVISOR IMMEDIATELLY
    

    In your text, the string 110M00104 occurs once per line. So, once the first part ^A.*110M00104 has been matched, the negative look-ahead (?!.*110M00104$) is always true, because no more string 110M00104 occurs at the end of line !

    Even if we change the last line, in such a way :

    A0000059700000001 000000012016101020161010NLEE 4 110M00104 110M00104 SUPERVISOR IMMEDIATELLY
    

    Either, the regex ^A.*110M00104 or your regex ^A.*110M00104(?!.*110M00104$), would match the string A0000059700000001 000000012016101020161010NLEE 4 110M00104 110M00104


    If you, really, want to match a specific string ( for instance the string ABCD ) with the condition that no other string ABCD occurs, further on, in the same line, you should use the negative look-ahead (?!.*ABCD.*ABCD), evaluated, at beginning of the current line !

    So, given the simple example text below :

    A000 Line 1 12345 ABCD Some text after
    A000 Line 2 12345 ABCD       ABCD
    A000 Line 3 12345 ABCD
    A000 Line 4 12345 ABCD       ABCD   Test
    

    The 3 regexes, below, do not match the lines 2 and 4, which contain the string ABCD, twice :-)

    • ^(?!.*ABCD.*ABCD)A.*ABCD matches between A000 and the unique occurrence of ABCD ( Line 1 and 3 )

    • ^(?!.*ABCD.*ABCD)A.*ABCD(?=.) matches between A000 and the unique occurrence of ABCD, which does not end the line ( Line 1, only )

    • ^(?!.*ABCD.*ABCD)A.*ABCD(?=\R) matches between A000 and the unique occurrence of ABCD, which ends the line ( Line 3, only )


    Of course, the four regexes, below, without the look-ahead, consider all the lines and, moreover :

    • ^A.*ABCD matches between A000 and the last occurrence of the string ABCD ( Line 1 to 4 )

    • ^A.*?ABCD matches between A000 and the first occurrence of the string ABCD ( Line 1 to 4 )

    • ^A.*ABCD(?=.) matches between A000 and an occurrence of ABCD, which does not end the line ( Lines 1, 2 and 4 )

    • ^A.*ABCD(?=\R) matches between A000 and an occurrence of ABCD, which ends the line ( Lines 2 and 3 )

    Best Regards,

    guy038

    P.S. :

    I forgot to mention that the . matches newline option must be UNTICKED, of course !


Log in to reply