Delete ";" in a description column
-
Hi,
I need to delete the characters “;” that are in a column named “Description”, but the problem is that I have more than 3 columns separated by “;” as well (Code, Quantity, Price).
Example:
Code;Description;Quantity;Price
0001;Chocol;te;12.00;1.50I need as follow:
Code;Description;Quantity;Price
0001;Chocol te;12.00;1.50The code always have 4 characters, the problem is that the description has no fixed number of characters.
The file has more than a million rows.
Thanks for your help.
-
In general, it’s better to fix poorly formatted CSV (where “c” means “character”, not “comma”, so “semicolon-separated values” is also a form of CSV) at the source, rather than after the fact.
If you cannot, and assuming that the rule that I am about to state will work for your data (it won’t work for every data; if there’s ever quotes or escapes that change the rules, my regex will not work), then here’s a go.
Rule: first semicolon is always valid; last two semicolons are always valid (the penultimate and ultimate); any semicolon between the first semicolon and the penultimate semicolon is assumed invalid, and will be converted into a space character.
- FIND =
^[^;]*;[^;]*\K;(?=.*;[^;]*;[^;]*$)
- REPLACE = space character
- SEARCH MODE = Regular Expression
- hit Replace All until they are all gone
This finds the beginning of the line, any number of non-semicolons, then the first semicolon, then any number of non-semicolons, then resets the match. This effectively moves the cursor to just at the second semicolon in your line.
Then it matches the literal second semicolon in your line.
The stuff in the parentheses is a lookahead saying, effectively, if there are at least two semicolons later in this line, then the semicolon we just found is not the first, penultimate, or ultimate, so we want to replace it with a space.
With this version, you have to hit replace all multiple times in some circumstances.
0001;Chocol;te;12.00;1.50 0001;C;o;o;a;e;12.00;1.50 0001;;;;;;;;;;;12.00;1.50 0001;Chocolate;12.00;1.50
becomes
0001;Chocol te;12.00;1.50 0001;C o o a e;12.00;1.50 0001; ;12.00;1.50 0001;Chocolate;12.00;1.50
after 9 Replace Alls
If there are N+3 semicolons on the “worst” line, it will only require N hits of Replace All
Ooh, an idea. Using the
\G
to mean “start from the previous match”, and an alternation,- FIND =
(^[^;]*;|\G)[^;]*\K;(?=.*;[^;]*;[^;]*$)
- other parameters the same
it will now match “after the first semicolon, or after the previous match”, as long as there are still at least two semicolons after the current semicolon. This means that the same data will only require one Replace All to work.
There are probably edge cases that are not handled in the expressions given. Oh, like what if there are lines that don’t have at least three semicolons… In certain circumstances, you might need to change
[^;]*
(which says “0 or more non-semicolons”) into[^;\r\n]*
(which says “0 or more characters that are neither semicolons nor CR nor LF (EOL sequences)”) – there might be some way that the simpler form would match across multiple lines; with that correction, it wouldn’t. But my existing expressions work with my test data.If that’s not sufficient, study the information below, try modifying it yourself, and if you’re having difficulty understanding where it’s going wrong, ask with more details (including what you tried; make sure to include that you tried my exact expression(s) with my sample dummy data set)
----
Do you want regex search/replace help? Then please be patient and polite, show some effort, and be willing to learn; answer questions and requests for clarification that are made of you. All example text should be marked as plain text using the
</>
toolbar button or manual Markdown syntax. Screenshots can be pasted from the clipboard to your post usingCtrl+V
to show graphical items, but any text should be included as literal text in your post so we can easily copy/paste your data. Show the data you have and the text you want to get from that data; include examples of things that should match and be transformed, and things that don’t match and should be left alone; show edge cases and make sure you examples are as varied as your real data. Show the regex you already tried, and why you thought it should work; tell us what’s wrong with what you do get… Read the official NPP Searching / Regex docs and the forum’s Regular Expression FAQ. If you follow these guidelines, you’re much more likely to get helpful replies that solve your problem in the shortest number of tries. - FIND =
-
Thank you very much. It helps me a lot.
Now, I just have received another file with a Column added at the beginning and need to replace de “;” in the Description column. (As previous problem)
Example:
New line:
STORE;CODE;DESCRIPTION;QUANTITY;PRICE
0036;359116;CHOCOLA;E;0.000;4.993028
Must be:
STORE;CODE;DESCRIPTION;QUANTITY;PRICE
0036;359116;CHOCOLA E;0.000;4.993028What do I have to modify on the regex that you sent before ?
Thanks for your help.
-
This post is deleted! -
@PeterJones
Hi Peter,
After a lot of tries, it works with the following expression:(^[^;];[^;];|\G)[^;]\K;(?=.;[^;];[^;]$)
Thank you very much.