Notepad++ Searching by columns for non alpha characters
-
I have text files that can be thousands of lines. I am hoping that there is a way to search specific columns and replace non alpha characters with spaces. The files have 5 different types of data and I can’t have any non alpha characters in the first and last names columns or else the file would be rejected. I need to be able to search only the second and third columns. Is there a way to highlight those 2 columns and then do a search and replace? If that is possible, can a parameter be set to replace all non alpha characters at once?
200068897 LLC1 BLUEONI 6085264 20240209
201499188 LLC_ LAKISHM 6085269 20240209
720253950 CUR EASTERN 6085270 20240209 -
Presumably the Columns++ plugin will help you.
-
@Ray-Naylor, your column separator or delimiter seems to be a single space and so I would question your choice of replacing non-alpha characters with spaces. I would replace non-alpha characters with nothing, meaning the values will be shorter, or with a character such as
_
,-
, or~
which keeps the value widths the same.You can deal your project it via a regular expression but using the Columns++ plugin that @Alan-Kilborn suggested will simplify the project for you as Columns++ will take care of parsing the data into columns meaning you can focus on whatever it is you want to do within a column.
-
@Ray-Naylor said in Notepad++ Searching by columns for non alpha characters:
I have text files that can be thousands of lines. I am hoping that there is a way to search specific columns and replace non alpha characters with spaces. The files have 5 different types of data and I can’t have any non alpha characters in the first and last names columns or else the file would be rejected. I need to be able to search only the second and third columns. Is there a way to highlight those 2 columns and then do a search and replace? If that is possible, can a parameter be set to replace all non alpha characters at once?
First off… it’s difficult to be sure what your data is, because you didn’t enclose it in a code block. The way to do that is to put ```text on a line by itself before the block and ``` on a line by itself after. When I look at the text that shows up when I do a quoted response to your post, it looks like your example is this:
200068897 LLC1 BLUEONI 6085264 20240209 201499188 LLC_ LAKISHM 6085269 20240209 720253950 CUR EASTERN 6085270 20240209
but I suspect you typed it in rather that copied and pasted, and the proportional font in the input window led you to mess up your spacing; so that your data is more like this:
200068897 LLC1 BLUEONI 6085264 20240209 201499188 LLC_ LAKISHM 6085269 20240209 720253950 CUR EASTERN 6085270 20240209
If the columns do line up like that, the Columns++ plugin might be able to help. However, it only works on one file at a time. My guess from your post is that you might have a large number of files and were hoping to do them all at once rather than one by one.
If you have many files, then we’re back to regular expressions. The trick will be to understand exactly what the rule is for dividing into columns — is it that any string of one or more blanks separates columns? Is it that each column has specific character positions within the line (e.g., first column is positions 1-9, second columns is 11-18, etc.) regardless of the data?
It is probably possible to write a regular expression, but to help you, we need to know a little more. Also, tell us if you are at all familiar with regular expressions for search and replace, so we have an idea where to start explaining.
-
You could use regular expressions to edit your tab-separated variables files, but I would recommend learning Python’s pandas or something like that and making your life easier. The CSVLint plugin can auto-generate Python scripts that you might find helpful if you’re new to pandas.
And yes, I know that people have all sorts of reasons for not following this advice. If you are one of those people, I respect your decision and don’t think less of you, and you don’t need to reply to this comment to explain yourself.
-
Thanks for the response. Columns++ helps a lot. Do you know what parameters I can put into the Find what field that I could use for replacing all non alpha characters? I tried [^a-zA-Z0-9\s]+ but that didn’t find anything. Searching by individual characters does work (1, 2, 3, #, @ etc.) works though
-
@Ray-Naylor said in Notepad++ Searching by columns for non alpha characters:
Thanks for the response. Columns++ helps a lot. Do you know what parameters I can put into the Find what field that I could use for replacing all non alpha characters? I tried [^a-zA-Z0-9\s]+ but that didn’t find anything. Searching by individual characters does work (1, 2, 3, #, @ etc.) works though
You probably want
[^\d\l\u\s]
— match any single character that is not a digit, a lowercase letter, an uppercase letter, or a space.The problem with A-Z and a-z is that ranges are based on the numeric byte values assigned to the characters. The alphabetic characters are not contiguous in ASCII, so various other characters get caught up in those ranges.Edit: As @mkupper noted below, the last paragraph above is incorrect. I’m not sure why your expression didn’t work; it should have matched the underscore in your example data, as it does when I try it.
-
Thanks, that replaced everything that wasn’t a letter except it didn’t replace the numbers. I would have thought by your description that the “d” would have replaced them.
-
@Ray-Naylor said in Notepad++ Searching by columns for non alpha characters:
Thanks, that replaced everything that wasn’t a letter except it didn’t replace the numbers. I would have thought by your description that the “d” would have replaced them.
The
[^
…]
construction is listing what is excluded. If you don’t want to exclude digits, leave out the\d
. -
@Coises said in Notepad++ Searching by columns for non alpha characters:
The problem with A-Z and a-z is that ranges are based on the numeric byte values assigned to the characters. The alphabetic characters are not contiguous in ASCII, so various other characters get caught up in those ranges.
The A-Z, a-z alphabetic, and 0-9 characters are all contiguous within themselves in ASCII, ANSI, and Unicode. Maybe you are thinking of EBCDIC?
There was nothing wrong with @Ray-Naylor’s
[^a-zA-Z0-9\s]+
other than it included\s
which is spaces, tabs, newlines, and much more.@Ray-Naylor - you are on the right track. I personally would continue to experiment. For example, put
200068897 LLC1 BLUEONI 6085264 20240209 201499188 LLC_ LAKISHM 6085269 20240209 720253950 CUR EASTERN 6085270 20240209
in a Notepad++ tab.
Hint 1 - Putting
(?-i)
in a regular expression turn off the case-sensitive checkbox in the Find/Search panel. Typically we put the(?-i)
at the front of the expression as it sets the flag state from that point on to the right when reading the expression. You are allowed to flip the ignore-case flag on(?i)
and off(?-i)
as desired within an expression.Hint 2 - When experimenting I often use Mark (
Ctrl+M
) which adds a red background to the stuff that matches. Remember to clickClear all marks
to retest and confirm stuff.Hint 3 - When experimenting I often do a Find (
Ctrl+F
), close the find box, and then tap theF3
key to do the find again and see what it selects.Hint 4 - I will use
\x20
at times below as that matches a space. It’s hard to accurately see spaces on the forums and within regular expressions and so I use\x20
when the use of a space is not obvious. You can use either\x20
or(?-i)[A-Z]+
will match one or more upper-case alphabetic characters.(?-i)[^A-Z]+
will match one or more characters that are not upper-case alphabetic characters. This also matches the end of line characters.
Experiment with both
(?-i)[A-Z]+
and(?-i)[^A-Z]+
of those and see what it finds or does not find in your data.I personally almost never use
\s
as it matches spaces, tabs, plus the end of line characters, and much more. The full list of what\s
matches seems to be:HT
\x{0009} HT - CHARACTER TABULATION (\t
)LF
\x{000A} LF - LINE FEED (\n
)VT
\x{000B} VT - LINE TABULATION (there is\v
but that also matches end of line characters and more)FF
\x{000C} FF - FORM FEED (\f
)CR
\x{000D} CR - CARRIAGE RETURN (\r
)SP
\x{0020} SPACENBSP
\x{00A0} NO-BREAK SPACEOSPM
\x{1680} OGHAM SPACE MARKMVS
\x{180E} MONGOLIAN VOWEL SEPARATORNQSP
\x{2000} EN QUADMQSP
\x{2001} EM QUAD (mutton quad; nominally, the height of the font)ENSP
\x{2002} EN SPACE (nut; half an em)EMSP
\x{2003} EM SPACE (mutton; nominally, a space equal to the type size in points; may scale by the condensation factor of a font)3/MSP
\x{2004} THREE-PER-EM SPACE (thick space)4/MSP
\x{2005} FOUR-PER-EM SPACE (mid space)6/MSP
\x{2006} SIX-PER-EM SPACE (in computer typography sometimes equated to thin space)FSP
\x{2007} FIGURE SPACE (space equal to tabular width of a font; this is equivalent to the digit width of fonts with fixed-width digits)PSP
\x{2008} PUNCTUATION SPACE (space equal to narrow punctuation of a font)THSP
\x{2009} THIN SPACE (a fifth of an em (or sometimes a sixth))HSP
\x{200A} HAIR SPACE (thinner than a thin space; in traditional typography, the thinnest space available)LS
\x{2028} LINE SEPARATORPS
\x{2029} PARAGRAPH SEPARATORNNBSP
\x{202F} NARROW NO-BREAK SPACE (commonly abbreviated NNBSP; a narrow form of a no-break space, typically the width of a thin space or a mid space)MMSP
\x{205F} MEDIUM MATHEMATICAL SPACE (abbreviated MMSP; four-eighteenths of an em)IDSP
\x{3000} IDEOGRAPHIC SPACE
as
\s
is so inclusive in what it matches I tend to use[\x20\t]
to match either space or tab and[^\x20\t]
to match anything that is not a space or tab.Using Marking or the Find and
F3
thing experiment with\s+
,[^\s]+
,[\x20\t]
, and[^\x20\t]
to see what happens with your data.\d
matches the decimal digits[0-9]
. It’s not quite as deadly as\s
which includes the end of line marks as\d
only matches about 100 other characters:- ²³
- ¹
- ٠١٢٣
- ٤٥٦٧٨٩
- ۰۱۲۳۴۵۶۷۸۹
- ߀߁߂߃߄߅߆߇߈߉
- ०१२३४५६७८९
- ০১২৩৪৫৬৭৮৯
- ੦੧੨੩੪੫੬੭੮੯
- ૦૧૨૩૪૫૬૭૮૯
- ୦
- ୧୨୩୪୫୬୭୮୯
- 0௧௨௩௪௫௬௭௮௯
- ౦౧౨౩౪౫౬౭౮౯
- ೦೧೨೩೪೫೬೭೮೯
- ൦൧൨൩൪൫൬൭൮൯
- ๐๑๒๓๔๕๖๗๘๙
- ໐໑໒໓໔໕໖໗໘໙
- ༠༡༢༣༤༥༦༧༨
- ༩
- ၀၁၂၃၄၅၆၇၈၉
- ႐႑႒႓႔႕႖႗႘႙
- ០១២៣៤៥៦៧៨៩
- ᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙
- ᥆᥇᥈᥉᥊᥋᥌᥍
- ᥎᥏
- ᧐᧑᧒᧓᧔᧕᧖᧗᧘᧙
- ᭐᭑᭒᭓᭔᭕᭖᭗᭘᭙
- ᮰᮱᮲᮳᮴᮵᮶᮷᮸᮹
- ᱀᱁᱂᱃᱄᱅᱆᱇᱈᱉
- ᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙
- ꘠꘡꘢꘣꘤꘥꘦꘧꘨꘩
- ꣐꣑꣒꣓꣔꣕꣖꣗꣘꣙
- ꤀꤁꤂꤃꤄꤅꤆꤇꤈꤉
- ꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙
- 0123456789
For your data you can safely experiment with
[0-9]+
,[^0-9]+
,\d+
, and[^\d]+
though should discover that[^0-9]+
and[^\d]+
also match end of line marks.To make
[^0-9]+
and[^\d]+
less inclusive you could experiment with[^0-9\r\n]+
and[^\d\r\n]+
or even[^0-9\s]+
and[^\d\s]+
.While
\R
matches end of line marks you can’t use it within[ ]
blocks. Experiment with[\R]
or[^\R]
to discover that it matches or or does not match just the letterR
. The reason for this exception is that\R
is handled like(\r\n|\r|\n)
. While\R
matches\r\n
it does not match\n\r
in that order though it will match the\n
and then matches the\r
. That little business of matching the two character\r\n
but not\n\r
is why\R
can’t be used within[ ]
blocks. -
@mkupper said in Notepad++ Searching by columns for non alpha characters:
The A-Z, a-z alphabetic, and 0-9 characters are all contiguous within themselves in ASCII, ANSI, and Unicode. Maybe you are thinking of EBCDIC?
I’m not sure what was on my mind. I haven’t used EBCDIC since 1991, but… I should have checked my memory, in any case. Thank you for catching that.
-
@Coises
Thank you soo much! Works perfectly now.