Notepad++ Searching by columns for non alpha characters

Ray Naylor

I have text files that can be thousands of lines. I am hoping that there is a way to search specific columns and replace non alpha characters with spaces. The files have 5 different types of data and I can’t have any non alpha characters in the first and last names columns or else the file would be rejected. I need to be able to search only the second and third columns. Is there a way to highlight those 2 columns and then do a search and replace? If that is possible, can a parameter be set to replace all non alpha characters at once?

200068897 LLC1 BLUEONI 6085264 20240209
201499188 LLC_ LAKISHM 6085269 20240209
720253950 CUR EASTERN 6085270 20240209

Alan Kilborn

@Ray-Naylor

Presumably the Columns++ plugin will help you.

mkupper

@Ray-Naylor, your column separator or delimiter seems to be a single space and so I would question your choice of replacing non-alpha characters with spaces. I would replace non-alpha characters with nothing, meaning the values will be shorter, or with a character such as _, -, or ~ which keeps the value widths the same.

You can deal your project it via a regular expression but using the Columns++ plugin that @Alan-Kilborn suggested will simplify the project for you as Columns++ will take care of parsing the data into columns meaning you can focus on whatever it is you want to do within a column.

Coises

@Ray-Naylor said in Notepad++ Searching by columns for non alpha characters:

I have text files that can be thousands of lines. I am hoping that there is a way to search specific columns and replace non alpha characters with spaces. The files have 5 different types of data and I can’t have any non alpha characters in the first and last names columns or else the file would be rejected. I need to be able to search only the second and third columns. Is there a way to highlight those 2 columns and then do a search and replace? If that is possible, can a parameter be set to replace all non alpha characters at once?

First off… it’s difficult to be sure what your data is, because you didn’t enclose it in a code block. The way to do that is to put ```text on a line by itself before the block and ``` on a line by itself after. When I look at the text that shows up when I do a quoted response to your post, it looks like your example is this:

200068897        LLC1                  BLUEONI           6085264                     20240209                    
201499188        LLC_                   LAKISHM             6085269                     20240209                    
720253950        CUR                  EASTERN              6085270                     20240209

but I suspect you typed it in rather that copied and pasted, and the proportional font in the input window led you to mess up your spacing; so that your data is more like this:

200068897 LLC1     BLUEONI     6085264 20240209
201499188 LLC_     LAKISHM     6085269 20240209
720253950 CUR      EASTERN     6085270 20240209

If the columns do line up like that, the Columns++ plugin might be able to help. However, it only works on one file at a time. My guess from your post is that you might have a large number of files and were hoping to do them all at once rather than one by one.

If you have many files, then we’re back to regular expressions. The trick will be to understand exactly what the rule is for dividing into columns — is it that any string of one or more blanks separates columns? Is it that each column has specific character positions within the line (e.g., first column is positions 1-9, second columns is 11-18, etc.) regardless of the data?

It is probably possible to write a regular expression, but to help you, we need to know a little more. Also, tell us if you are at all familiar with regular expressions for search and replace, so we have an idea where to start explaining.

Mark Olson

You could use regular expressions to edit your tab-separated variables files, but I would recommend learning Python’s pandas or something like that and making your life easier. The CSVLint plugin can auto-generate Python scripts that you might find helpful if you’re new to pandas.

And yes, I know that people have all sorts of reasons for not following this advice. If you are one of those people, I respect your decision and don’t think less of you, and you don’t need to reply to this comment to explain yourself.

Ray Naylor

@mkupper

Thanks for the response. Columns++ helps a lot. Do you know what parameters I can put into the Find what field that I could use for replacing all non alpha characters? I tried [^a-zA-Z0-9\s]+ but that didn’t find anything. Searching by individual characters does work (1, 2, 3, #, @ etc.) works though

Coises

@Ray-Naylor said in Notepad++ Searching by columns for non alpha characters:

@mkupper

Thanks for the response. Columns++ helps a lot. Do you know what parameters I can put into the Find what field that I could use for replacing all non alpha characters? I tried [^a-zA-Z0-9\s]+ but that didn’t find anything. Searching by individual characters does work (1, 2, 3, #, @ etc.) works though

You probably want [^\d\l\u\s] — match any single character that is not a digit, a lowercase letter, an uppercase letter, or a space.

The problem with A-Z and a-z is that ranges are based on the numeric byte values assigned to the characters. The alphabetic characters are not contiguous in ASCII, so various other characters get caught up in those ranges.

Edit: As @mkupper noted below, the last paragraph above is incorrect. I’m not sure why your expression didn’t work; it should have matched the underscore in your example data, as it does when I try it.

Ray Naylor

@Coises

Thanks, that replaced everything that wasn’t a letter except it didn’t replace the numbers. I would have thought by your description that the “d” would have replaced them.

Coises

@Ray-Naylor said in Notepad++ Searching by columns for non alpha characters:

@Coises

Thanks, that replaced everything that wasn’t a letter except it didn’t replace the numbers. I would have thought by your description that the “d” would have replaced them.

The [^…] construction is listing what is excluded. If you don’t want to exclude digits, leave out the \d.

mkupper

@Coises said in Notepad++ Searching by columns for non alpha characters:

The problem with A-Z and a-z is that ranges are based on the numeric byte values assigned to the characters. The alphabetic characters are not contiguous in ASCII, so various other characters get caught up in those ranges.

The A-Z, a-z alphabetic, and 0-9 characters are all contiguous within themselves in ASCII, ANSI, and Unicode. Maybe you are thinking of EBCDIC?

There was nothing wrong with @Ray-Naylor’s [^a-zA-Z0-9\s]+ other than it included \s which is spaces, tabs, newlines, and much more.

@Ray-Naylor - you are on the right track. I personally would continue to experiment. For example, put

200068897        LLC1                  BLUEONI           6085264                     20240209
201499188        LLC_                   LAKISHM             6085269                     20240209
720253950        CUR                  EASTERN              6085270                     20240209

in a Notepad++ tab.

Hint 1 - Putting (?-i) in a regular expression turn off the case-sensitive checkbox in the Find/Search panel. Typically we put the (?-i) at the front of the expression as it sets the flag state from that point on to the right when reading the expression. You are allowed to flip the ignore-case flag on (?i) and off (?-i) as desired within an expression.

Hint 2 - When experimenting I often use Mark (Ctrl+M) which adds a red background to the stuff that matches. Remember to click Clear all marks to retest and confirm stuff.

Hint 3 - When experimenting I often do a Find (Ctrl+F), close the find box, and then tap the F3 key to do the find again and see what it selects.

Hint 4 - I will use \x20 at times below as that matches a space. It’s hard to accurately see spaces on the forums and within regular expressions and so I use \x20 when the use of a space is not obvious. You can use either \x20 or (yes, there is a red space hidden there!) in your expressions.

(?-i)[A-Z]+ will match one or more upper-case alphabetic characters.
(?-i)[^A-Z]+ will match one or more characters that are not upper-case alphabetic characters. This also matches the end of line characters.

Experiment with both (?-i)[A-Z]+ and (?-i)[^A-Z]+ of those and see what it finds or does not find in your data.

I personally almost never use \s as it matches spaces, tabs, plus the end of line characters, and much more. The full list of what \s matches seems to be:

HT \x{0009} HT - CHARACTER TABULATION (\t)
LF \x{000A} LF - LINE FEED (\n)
VT \x{000B} VT - LINE TABULATION (there is \v but that also matches end of line characters and more)
FF \x{000C} FF - FORM FEED (\f)
CR \x{000D} CR - CARRIAGE RETURN (\r)
SP \x{0020} SPACE
NBSP \x{00A0} NO-BREAK SPACE
OSPM \x{1680} OGHAM SPACE MARK
MVS \x{180E} MONGOLIAN VOWEL SEPARATOR
NQSP \x{2000} EN QUAD
MQSP \x{2001} EM QUAD (mutton quad; nominally, the height of the font)
ENSP \x{2002} EN SPACE (nut; half an em)
EMSP \x{2003} EM SPACE (mutton; nominally, a space equal to the type size in points; may scale by the condensation factor of a font)
3/MSP \x{2004} THREE-PER-EM SPACE (thick space)
4/MSP \x{2005} FOUR-PER-EM SPACE (mid space)
6/MSP \x{2006} SIX-PER-EM SPACE (in computer typography sometimes equated to thin space)
FSP \x{2007} FIGURE SPACE (space equal to tabular width of a font; this is equivalent to the digit width of fonts with fixed-width digits)
PSP \x{2008} PUNCTUATION SPACE (space equal to narrow punctuation of a font)
THSP \x{2009} THIN SPACE (a fifth of an em (or sometimes a sixth))
HSP \x{200A} HAIR SPACE (thinner than a thin space; in traditional typography, the thinnest space available)
LS \x{2028} LINE SEPARATOR
PS \x{2029} PARAGRAPH SEPARATOR
NNBSP \x{202F} NARROW NO-BREAK SPACE (commonly abbreviated NNBSP; a narrow form of a no-break space, typically the width of a thin space or a mid space)
MMSP \x{205F} MEDIUM MATHEMATICAL SPACE (abbreviated MMSP; four-eighteenths of an em)
IDSP \x{3000} IDEOGRAPHIC SPACE

as \s is so inclusive in what it matches I tend to use [\x20\t] to match either space or tab and [^\x20\t] to match anything that is not a space or tab.

Using Marking or the Find and F3 thing experiment with \s+, [^\s]+, [\x20\t], and [^\x20\t] to see what happens with your data.

\d matches the decimal digits [0-9]. It’s not quite as deadly as \s which includes the end of line marks as \d only matches about 100 other characters:

²³
¹
٠١٢٣
٤٥٦٧٨٩
۰۱۲۳۴۵۶۷۸۹
߀߁߂߃߄߅߆߇߈߉
०१२३४५६७८९
০১২৩৪৫৬৭৮৯
੦੧੨੩੪੫੬੭੮੯
૦૧૨૩૪૫૬૭૮૯
୦
୧୨୩୪୫୬୭୮୯
0௧௨௩௪௫௬௭௮௯
౦౧౨౩౪౫౬౭౮౯
೦೧೨೩೪೫೬೭೮೯
൦൧൨൩൪൫൬൭൮൯
๐๑๒๓๔๕๖๗๘๙
໐໑໒໓໔໕໖໗໘໙
༠༡༢༣༤༥༦༧༨
༩
၀၁၂၃၄၅၆၇၈၉
႐႑႒႓႔႕႖႗႘႙
០១២៣៤៥៦៧៨៩
᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙
᥆᥇᥈᥉᥊᥋᥌᥍
᥎᥏
᧐᧑᧒᧓᧔᧕᧖᧗᧘᧙
᭐᭑᭒᭓᭔᭕᭖᭗᭘᭙
᮰᮱᮲᮳᮴᮵᮶᮷᮸᮹
᱀᱁᱂᱃᱄᱅᱆᱇᱈᱉
᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙
꘠꘡꘢꘣꘤꘥꘦꘧꘨꘩
꣐꣑꣒꣓꣔꣕꣖꣗꣘꣙
꤀꤁꤂꤃꤄꤅꤆꤇꤈꤉
꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙
０１２３４５６７８９

For your data you can safely experiment with [0-9]+, [^0-9]+, \d+, and [^\d]+ though should discover that [^0-9]+ and [^\d]+ also match end of line marks.

To make [^0-9]+ and [^\d]+ less inclusive you could experiment with [^0-9\r\n]+ and [^\d\r\n]+ or even [^0-9\s]+ and [^\d\s]+.

While \R matches end of line marks you can’t use it within [ ] blocks. Experiment with [\R] or [^\R] to discover that it matches or or does not match just the letter R. The reason for this exception is that \R is handled like (\r\n|\r|\n). While \R matches \r\n it does not match \n\r in that order though it will match the \n and then matches the \r. That little business of matching the two character \r\n but not \n\r is why \R can’t be used within [ ] blocks.

Coises

@mkupper said in Notepad++ Searching by columns for non alpha characters:

The A-Z, a-z alphabetic, and 0-9 characters are all contiguous within themselves in ASCII, ANSI, and Unicode. Maybe you are thinking of EBCDIC?

I’m not sure what was on my mind. I haven’t used EBCDIC since 1991, but… I should have checked my memory, in any case. Thank you for catching that.

Ray Naylor

@Coises
Thank you soo much! Works perfectly now.