Advance Replace including right trim (repost with example)
-
Hi,
I have read many posts on the subject but cannot find the desired solution. I posted this question before but repost it now with example and something i forgot to mention. (couldn’t edit original post)
For those who replied to my original deleted post, thanks so far but the solutions didn’t work.
issue is as follows:
I have a CSV-file that i want to use as an external table in Oracle.
It is formatted like this:
<values Field01>;<values Field02>;<values Field03>; <values Field04>
In fact there are more fields, but my issue is with the last field.
Field 1 to 3 are defined as Char (255) fields in the oracle external table.
The last field, Field04, is a memo field that should be max. 4000 characters in the interface CSV-file, but probably due to codeset the last field is way over 4000 Characters. (according to Excel)
In this last field LineFeed characters might be present.
End of record/line is with CarriageReturn + LineFeed.I need to find the pattern like "n-characters field01"Semicolon"n-characters field02"Semicolon"n-characters field03"Semicolon (or last delimiter found)
and trim the rest after the last semicolon to 4000 characters to let it fit in the Oracle table.
so field 1 to 3 with variable length + 4000 additional characters for field04.For my example instead of max 4000 characters lets say it is max 10 characters until EOL i want in the last field.
I want this original file:
xxxxxxxxxxxx;yyyy;zzzzzzzzzz;12\n34\n5678901234567890\r\n
xxx;yyyyyyyy;zzzzzz;12345\r\n
xxxxxx;yyy;zzzzzzzzzzzzzzzzzzz;1\n2\n3456789012345678901234567890\r\n
xxxxxx;yyyyy;zzzzzzzzzzzzzzzzzzzzz;1234567890\r\nchanged/replaced to:
xxxxxxxxxxxx;yyyy;zzzzzzzzzz;12\n34\n5678\r\n
xxx;yyyyyyyy;zzzzzz;12345\r\n
xxxxxx;yyy;zzzzzzzzzzzzzzzzzzz;1\n2\n345678\r\n
xxxxxx;yyyyy;zzzzzzzzzzzzzzzzzzzzz;1234567890\r\nHope that there’s a Notepad++Wizard here that can solve this.
Thanks in advance!
Mike -
My immediate thought was to “undelete” your deleted post. I am a (recent new) moderator and I consider you deleting that post after others have posted replies to it to be “bad form”. You probably had the best of intentions but anyone coming along some time in the future will attempt to read that thread and it won’t make sense.
That was your first mistake, the second was to start a new thread. It would have been preferable to just append this post to the original thread. We often have first time posters adding additional relevant information after prodding by the other members.
Since this is just my opinion I won’t “undelete” nor “move” this to the original thread. I will leave that to a more senior moderator to consider.
While I’m telling you what you should have done, here’s another one. When showing examples, please include them in a code box. That helps to prevent the posting engine from mangling the data.
All of this information is in the FAQ and also pinned at the start of each category. Unfortunately you, like many new posters, don’t read them. Yet you post here in the hope to get answers, so I guess my question is why didn’t you read the FAQ and pinned posts?
Terry
-
I think the expression you want is this:
^((?:(?:[^";\r\n]*+|\h*+"(?:[^"]|"")*+"\h*+);){3})(?:([^";\r\n]{0,4000}+)[^";\r\n]*+|(\h*+"(?:[^"]|""){0,4000}+)(?:[^"]|"")*+("\h*+))$
with replacement:
$1$2$3$4
In your example, you have values containing new line characters that are not quoted; that’s normally invalid in a CSV. If you really have that in your file, change the
[^";\r\n]
sequences to[^";\r]
. -
Hi, @mike-albers, @mark-olson, @terry-r, @coises and All,
Ah, I now understand that
\n
may occur in the first4,000
characters of the last field !A completely different goal to reach !
If I still assume that no line-break occurs in the first
3
fieldsAnd given the INPUT file :
xxxxxxxxxxxx;yyyy;zzzzzzzzzz;12 34 5678901234567890 xxx;yyyyyyyy;zzzzzz;12345 xxx;yyyyyyyy;zzzzzz; xxxxxx;yyy;zzzzzzzzzzzzzzzzzzz;1 2 3456789012345678901234567890 xxxxxx;yyyyy;zzzzzzzzzzzzzzzzzzzzz;1234567890
IMPORTANT : After pasting the INPUT code text above in a new tab, you must change, at the end, the current
\r\n
line-break by\n
in lines1
,2
,6
and7
, as shown below, BEFORE you apply the regex S/R !The following regex S/R should work :
-
FIND
(?s)^(?:[^\r\n;]+;){3}.{0,10}\r\n|^((?:[^\r\n;]+;){3}.{10}).+?\r\n
-
REPLACE
?1$1\r\n:$0
And produce this OUTPUT text :
xxxxxxxxxxxx;yyyy;zzzzzzzzzz;12 34 5678 xxx;yyyyyyyy;zzzzzz;12345 xxx;yyyyyyyy;zzzzzz; xxxxxx;yyy;zzzzzzzzzzzzzzzzzzz;1 2 345678 xxxxxx;yyyyy;zzzzzzzzzzzzzzzzzzzzz;1234567890
As you can see, after the last
;
of each record :-
The string
12\n34\n5678
, in lines1
,2
and3
, correctly contains10
characters and the final\r\n
-
The line
4
contains the string12345
and the final\r\n
-
The line
5
contains an empty string and the final\r\n
-
The string
1\n2\n345678
, in lines6
,7
and8
, correctly contains10
characters and the final\r\n
-
The line
9
contains the string1234567890
and the final\r\n
Now, as you said :
In fact there are more fields, but my issue is with the last field.
I suppose that you must change the numbers
3
of the regex by the exact number of fields before the last one, which size is over4,000
charactersThus, the general regex S/R is :
-
FIND
(?s)^(?:[^\r\n;]+;){
N}.{0,4000}\r\n|^((?:[^\r\n;]+;){
N}.{4000}).+?\r\n
-
REPLACE
?1$1\r\n:$0
Where
N
is the number of fields before the last one !If, in addition, the number of fields is variable, you could change the two
[3}
syntaxes by the{x,y}
syntax, wherex
andy
represent integersBest regards,
guy038
-
-
@Terry-R Sorry Terry, I did not read all of it. I was confused because it wasn’t possible to edit my original post after 4 hours of first posting. Since there were just a few respondents it seemed better to have the complete issue on top of the post. Otherwise new replies would probably be based on old information provided. I guess that people will not go through all of the discussion first.
sorry for the inconvenience.
Would be nice when after the 4 hours a direct timestamped-addendum at the original post would be allowed instead of a reply.Will not make the same mistake again.
Keep up the good work!
-
@guy038 Hi Guy, i think your solution is working after all.
in the tool something strange happens. But in Notepad it seems to work properly.
I tried it out on the testfile with the \n characters in the 4th field.
Now i will try it on my real life CSV file to see what happens there.So far so good.
Thanks!
-
@Mike-Albers said:
I tried out your solution with the online regex tool at regex101 site but it is not working.
Some of these regexes are quite “involved”. The more involved they are, the less likely they are to work in both regex101 and Notepad++; the reason for this is that they use different regular expression engines and all engines have nuanced processing when the regexes are not simple. It may not be the case here, but you should try all advice provided in Notepad++'s replace before coming to a conclusion.
-
This post has 7 revisions
As I typed my reply, I kept seeing screen flashes, so I investigated.
It appears that Guy is uber-editing his earlier response.
Hopefully, he’s not changing history, and and is always making harmless edits.
Otherwise, how is @Mike-Albers to “keep up” with the advice being provided?
EDIT: Now:
This post has 8 revisions
-
Hello, @alan-kilborn and All,
I agree that I edited my previous post a lot of times.
But it’s just because if you just paste the INPUT text in a new tab, you get all the sentences with a final line-break =
\r\n
And, of course, the regex S/R would not work in this case :-((
BR
guy038
-
@Alan-Kilborn you are right. I jumped to conslusions.
Tried in notepad++ and the solution from guy seems to work after all. Changed the reply asap. :-) -
@Alan-Kilborn said,
It appears that Guy is uber-editing his earlier response
@Mike-Albers said,
Changed the reply asap
In general, my preference is that posts not get edited after there’s a reply, because that breaks the flow of the conversation. In extreme circumstances, if there is an edit after a reply, I highly encourage marking it like “edit: xyz” or similar, or, if there’s a bunch of information that turns out to be wrong, using the ~~~ to strikethrough, like, “
old incorrect information[edited: see my reply below]” . This allows people to be able to see what was being responded to in the immediate replies, but informs them that something has been updated.It should be noted that even changing a post before there are any replies is dangerous, because someone may have read your original, and maybe even replying with quoting your original text, and having your text now be different makes it look like the person is misquoting you, which they are not actually doing.
(This discussion has a case in point: Alan quoted the regex101 line, and now it’s been edited away.)
As said in another forum where I spend a lot of time, “It is uncool to update a [post] in a way that renders replies confusing or meaningless”.
-
Amen. Don’t change posts after posting them, unless you are 100% sure you aren’t changing any meaning. That is, only change an obvious typo (but NOT one in an “expression”). Otherwise, follow Peter’s excellent advice.
-
M mkupper referenced this topic on
-
@guy038 Hi guy, I studied your solution and Regex itself and it starts to dawn at me.
I changed my testfile and tried in addition how to handle empty fields. For that i changed your searchstring a tiny bit but also added an extra OR clause.
It seems to work properly now.My latest testfile was like this:
My search pattern is now:
(?s)^(?:[^\r\n;];){3}.{0,24}\r\n|^((?:[^\r\n;];){3}.{0,24}).?\r\n|^(?:[^\r\n;]?)\r\nThe replace statement is still yours:
?1$1\r\n:$0Result was:
I tried to figure out the replace string, but i don’t get it.
(tried selfstudy on it with Regex0101 tool bit by bit but since it is not 100% compatible i couldn’t figure it out myself.) Really no lazyness on my part here when i ask my question.
So i hope you can explain it step by step for me.Thanks!