search and replace for a newbie
-
So I have about 1000 .txt files that need to have some of the data stripped in them, each contains the similar lines of data
"A0000058100000001 000000012016100420161004NLEE 4 110M00102 "the only constant in each file would be the leading “A00000” and the trailing "4 110M00102 "
Is there an easy way to do this in the “find in files dialog” using the location of the files and then “replace in files” for the 1000 files or does it have to get a little more complicated by adding code?
-
Not sure why this deserved a downvote…?
Okay, so Yes, there is an easy way… You didn’t say what, if anything, you want to replace the “stripped out” portion with, so let’s just assume it is purely a strip out…
Find what zone:
^A00000.*?4 110M00102$
Replace with zone:A000004 110M00102
Search mode: Regular expressionSo this is just very basic, and chances are good when you see what it does you will have an “Ah ha” moment, and then refine what you are really trying to do…
But anyway, this looks for the beginning of a line (the
^
), then your leading text (A00000
), then a minimal run of any characters (the.*?
), then your trailing text (…you get the idea…) at the end of the line (the$
specifies the end of line). -
@Scott-Sumner I cant thank you enough, not familiar enough with the notepad ++ and the operators, I was just trying to use the * wildcard and was not having much luck, and yes its purely a strip out to remove the unwanted data from the file extract generation. I appreciate it very much and will let you know how it goes!
-
ok still not doing something right…
-
@Barry-Payne said:
still not doing something right…
If you want additional help, you’ll have to provide more specifics than that…
-
@Scott-Sumner well I was trying to figure out away to put a pic in here … I’ll try to explain it
so I did the find in files
find what: ^A00000.*?4 110M00102$
replace with:
filter defaulted to .
put the directory in as to where the files are
regular expressionSearch “^A00000.*?4 110M00102$” (0 hits in 0 files)
-
I would “start simpler”…try a find (not replace yet) operation on a single file that you have loaded into an editor tab window. Save the Replace-in-Files for when you have totally debugged the operation on a single file…
If you want to embed an image in a posting on this site, then this thread gives an example of how to do so.
-
@Scott-Sumner so thats the thing when I search for a file specifically such as:
A0000062700000001 000000012016102820161028NLEE 4 110M00104
I get the following results
Search “A0000062700000001 000000012016102820161028NLEE 4 110M00104” (68 hits in 1 file)
its the generic search I cant seem to master, i would have to go into each file and copy and paste, but since I’m already in the file why not just do it there other than it would take me two days for the amount of files
-
First, an aside: your most recent example could not have matched the previous pattern, because in the previous example, it was ending in 102, now it’s ending in 104. Changing things midstream makes it more difficult for us to help you, especially when you don’t even point out you changed it. For all we know, the reason that
^A00000.*?4 110M00102$
didn’t find anything is because your lines all ended in 104, so should have been^A00000.*?4 110M00104$
.I’m not sure what you’re trying to say with the “i would have to go into each file and copy and paste”; if you’re complaining that one file at a time will take too long, you didn’t understand Scott. The reason Scott rightly suggested you start with just a find is because if you cannot get the find to work, you are not going to be able to find and replace across multiple files. So he suggested simplifying. Once the simple is working, then you can try to make it more complicated.
Now that we know you can successfully search for one of the long strings, let’s go the next step. We’ll start with a known file that we can easily replicate (to help you). If I take the two strings you’ve given us,
A0000062700000001 000000012016102820161028NLEE 4 110M00104 A0000058100000001 000000012016100420161004NLEE 4 110M00102
and make a file with 20 copies of each.
Now, in the FIND dialog, search for the same string you just successfully searched for. If you do the COUNT, it should find 20. Then change the string to
^A00000.*?4 110M00104$
. COUNT should still give you 20. Change the string to^A00000.*?4 110M00102$
– this should find the other 20 instances in this dummy file. At this time I took a screenshot (below). (I realized while typing this up, I should have made the dummy file with different numbers of each line, so the counts would change, but I don’t want to regenerate the file at this point)Screenshot (image embedded as
![](https://i.imgur.com/7h05zUe.png)
):
Now, if that’s working, you can try the “Replace” tab in one file. Then, once that’s working for you, you can use the “Find in Files” to do multiple files at once.
-
ok so I’ll try to be more clear, some will end in 2 and some will end in 4, I thought if the original post with a 2 worked then I could change it to 4 and it would work as well…
there are about 1000 files downloaded as .txt they were dumped by a program that used ASCII to generate the files in a batch mode, could the encoding be an issue?
this is an example of lines from inside one file
A0000059700000001 000000012016101020161010NLEE 4 110M00104 DL2 Cross Roller Monthly PM
A0000059700000001 000000012016101020161010NLEE 4 110M00104
A0000059700000001 000000012016101020161010NLEE 4 110M00104 SAFETY FIRST- USE LOCK OUT/TAG OUT PRIOR
A0000059700000001 000000012016101020161010NLEE 4 110M00104 TO DOING THIS JOB
A0000059700000001 000000012016101020161010NLEE 4 110M00104
A0000059700000001 000000012016101020161010NLEE 4 110M00104 Determine what energy sources will be
A0000059700000001 000000012016101020161010NLEE 4 110M00104 locked out. (Electrical, Gas, Pneumatic,
A0000059700000001 000000012016101020161010NLEE 4 110M00104 Hydraulic, Steam, Etc#)
A0000059700000001 000000012016101020161010NLEE 4 110M00104
A0000059700000001 000000012016101020161010NLEE 4 110M00104 IF YOU ARE NOT SURE WITH THE ABOVE- SEE
A0000059700000001 000000012016101020161010NLEE 4 110M00104 SUPERVISOR IMMEDIATELLYso the find/count option you listed above works great with the long file like
A0000059700000001 000000012016101020161010NLEE 4 110M00104again its the generic I can get to work no matter the combination I use, and I’ve played with quite a few combinations and versions of ^A00000.*?4 110M00104$
Sorry I’m not catching on fast enough
-
Don’t about the 1000 yet; ignore multiple files; we haven’t gotten that far, as I said before. Until you give us good enough data to make a valid search string, the 1000 files, or the replace field, doesn’t matter.
Until you can get the generic match working, nothing else matters.
Given your new data set, when you tell the match that the line must end after the
110M00104
by ending it with a$
, of course it’s not going to find any of the lines that contain information after the110M00104
Let’s take a step back to regular expressions.
First, in my image, you will see that
☑ Regular Expression
is selected. Is that what you have? If not, you will need it, otherwise the special characters we are using will not do the wildcard matching we desire.Next, do you understand what the
^
and$
and.*?
do in the regular expression? I am thinking not, given your terminology, and the sudden surprise of data after the110M00104
.^
matches begininning of line. So the line must start with whatever follows the^
, or it will not match. No spaces, no tabs, just thatA00000
sequence. Do you have spaces at the beginning of the line, or is theA
really the first character?$
matches end of line. So the line must end with whatever comes before the$
… in this case,110M00104
. If the line doesn’t end with that, it cannot match..
is a wildcard that matches any single character*
makes the previous character (or wildcard) match 0 or more instances. So.*
means “match 0 or more instances of any character”?
makes the previous character or wildcard non-greedy, so.*?
means “match 0 or more instances of any character, but don’t be greedy about it”.- wrapping characters in
[]
will allow any of those characters to match. For example, if we want to match4 110M00104
or4 110M00102
, we could write that as4 110M0010[24]
. I will be using that notation below, so that’s why I explained it, even though we hadn’t seen it yet.
Now, let’s answer some questions:
- Will there ever be spaces or tabs or other characters before the
A
of theA00000
which you said was a constant? - Will there ever be anything different (
B00000
, orA12345
), or isA00000
really a constant. - You have mentioned
4 110M00104
and4 110M00102
as constant “end-markers”. Are there any others? Are there any other restrictions (will there always be a space before the4
? or a space-or-tab? anything else we should know?) - Can anything occur after that? You have shown some sort of text after, so I am now thinking the answer will be “yes”, but maybe this new data isn’t representative, either, so I cannot really assume anything. If something can come after, are there any restrictions as to what comes after? Will there always be a space between the
4 110M0010[24]
and the extra text, or not? - Are there any restrictions as to what comes between the
A00000
and the4 110M0010[24]
? Can it contain the string4 110M0010[24]
or not? You always seem to show the same number of characters; will that always be true, or can it be longer or shorter? Will there ever be more or fewer spaces in the intervening text or not? - Will any of these matches ever span two or more lines (ie, will the stuff between the
A00000
and the4 110M0010[24]
, or the stuff after the4 110M0010[24]
, ever contain newlines?) - Is there any other characteristic or feature or restriction or extra format of the data that you’ve forgotten to mention that I haven’t asked about?
Once we get the right data matching format, and can get it consistently matching what you think are the right number of lines in your data, then we will start working on the replace requirements. Only after we’ve got the search and replace working in one file will we even consider moving on to multiple files. I would love to give it to you all at once, but these surprise-requirements and differences in data rows are making that impossible.
-
So after playing around and searching the web I ended up with this
^A.*110M00104(?!.*110M00104$)
the count works
then I went to find in file and it worked as wellgoing to copy the folder with the files and test
Thanks for the help…
-
Glad it’s working for you.
For extra regex information, here are some good links, compiled and often quoted by @guy038, the forum’s regex guru (I am probably going to start quoting/paraphrasing Guy’s frequent postscript, because it’s an awesome summary of regex information and links.)
Here is a good starting point for NPP users unfamiliar with regular expression concepts and syntax:
Modern Notepad++ (since v6.0) uses the Boost C++ Regex library, v1.55.0 (similar to the Perl Regular Common Expressions (PRCE), v5.8):
- search syntax: http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html
- replace syntax: http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/format/boost_format_syntax.html
Other sites with valuable regular expression information include:
-
@PeterJones and @ScottSumner I appreciate the information about the expressions as your right I wasn’t familiar with them, they helped me to finally get the count working, after that I moved forward with the find in file, I really appreciate you guys taking time out to help me understand the software and how much more useful it can be.
is there some way to reward points or something?
-
@Barry-Payne said:
reward points or something?
Yes, we prefer bitcoin … :-D
…or just “upvote”: See the little^ 0 v
to the right of every post? It might say^ 1 v
instead if someone else “liked” the posting previously. PRESS THE^
! Help me catch @Claudia-Frank…just kidding, I could never surpass CF…Which leads back to my original question on why someone decided your original posting in this thread was worth a “downvote”…?
@PeterJones thanks for picking up this thread and bringing it to an apparently successful conclusion…I got busy and couldn’t get back to it til now…
-
Hello, @barry-payne and All,
Sorry to discuss about an already solved problem but I do think that your regex
^A.*110M00104(?!.*110M00104$)
could be simply written :^A.*110M00104
Indeed, given your text, these two regexes give the same results : all lines from
A0000
till110M00104
are matched !A0000059700000001 000000012016101020161010NLEE 4 110M00104 DL2 Cross Roller Monthly PM A0000059700000001 000000012016101020161010NLEE 4 110M00104 A0000059700000001 000000012016101020161010NLEE 4 110M00104 SAFETY FIRST- USE LOCK OUT/TAG OUT PRIOR A0000059700000001 000000012016101020161010NLEE 4 110M00104 TO DOING THIS JOB A0000059700000001 000000012016101020161010NLEE 4 110M00104 A0000059700000001 000000012016101020161010NLEE 4 110M00104 Determine what energy sources will be A0000059700000001 000000012016101020161010NLEE 4 110M00104 locked out. (Electrical, Gas, Pneumatic, A0000059700000001 000000012016101020161010NLEE 4 110M00104 Hydraulic, Steam, Etc#) A0000059700000001 000000012016101020161010NLEE 4 110M00104 A0000059700000001 000000012016101020161010NLEE 4 110M00104 IF YOU ARE NOT SURE WITH THE ABOVE- SEE A0000059700000001 000000012016101020161010NLEE 4 110M00104 SUPERVISOR IMMEDIATELLY
In your text, the string
110M00104
occurs once per line. So, once the first part^A.*110M00104
has been matched, the negative look-ahead(?!.*110M00104$)
is always true, because no more string110M00104
occurs at the end of line !Even if we change the last line, in such a way :
A0000059700000001 000000012016101020161010NLEE 4 110M00104 110M00104 SUPERVISOR IMMEDIATELLY
Either, the regex
^A.*110M00104
or your regex^A.*110M00104(?!.*110M00104$)
, would match the string A0000059700000001 000000012016101020161010NLEE 4 110M00104 110M00104
If you, really, want to match a specific string ( for instance the string ABCD ) with the condition that no other string ABCD occurs, further on, in the same line, you should use the negative look-ahead
(?!.*ABCD.*ABCD)
, evaluated, at beginning of the current line !So, given the simple example text below :
A000 Line 1 12345 ABCD Some text after A000 Line 2 12345 ABCD ABCD A000 Line 3 12345 ABCD A000 Line 4 12345 ABCD ABCD Test
The
3
regexes, below, do not match the lines2
and4
, which contain the stringABCD
, twice :-)-
^(?!.*ABCD.*ABCD)A.*ABCD
matches between A000 and the unique occurrence of ABCD ( Line1
and3
) -
^(?!.*ABCD.*ABCD)A.*ABCD(?=.)
matches between A000 and the unique occurrence of ABCD, which does not end the line ( Line1
, only ) -
^(?!.*ABCD.*ABCD)A.*ABCD(?=\R)
matches between A000 and the unique occurrence of ABCD, which ends the line ( Line3
, only )
Of course, the four regexes, below, without the look-ahead, consider all the lines and, moreover :
-
^A.*ABCD
matches between A000 and the last occurrence of the string ABCD ( Line1
to4
) -
^A.*?ABCD
matches between A000 and the first occurrence of the string ABCD ( Line1
to4
) -
^A.*ABCD(?=.)
matches between A000 and an occurrence of ABCD, which does not end the line ( Lines1
,2
and4
) -
^A.*ABCD(?=\R)
matches between A000 and an occurrence of ABCD, which ends the line ( Lines2
and3
)
Best Regards,
guy038
P.S. :
I forgot to mention that the
. matches newline
option must be UNTICKED, of course ! -