How to change a Semicolon-Separated Values list in a well-formatted table, with fixed length columns

guy038

Hi All,

Recently, consulting some documents, from the Unicode Consortium, at the address below :

http://www.unicode.org/versions/components-8.0.0.html

I downloaded, in N++, the contents of the UnicodeData.txt file, from the link below :

http://www.unicode.org/Public/8.0.0/ucd/UnicodeData.txt

and I wanted to improve the displaying of that list, by creating a table, with fixed length columns !

Of course, using EXCEL, I could have created a file, with the semi-colon character as a list separator. However, I preferred to get it modified, directly, in Notepad++ !

This file contains 29215 lines, with 15 fields, empty or not, and is, only, 1670 Kb length. So, It should be easily modified, on any N++ configuration !!

So, in this discussion, I will explain the way I used to get, from the original list L, partially shown, below :

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
01EF;LATIN SMALL LETTER EZH WITH CARON;Ll;0;L;0292 030C;;;;N;LATIN SMALL LETTER YOGH HACEK;;01EE;;01EE
01F0;LATIN SMALL LETTER J WITH CARON;Ll;0;L;006A 030C;;;;N;LATIN SMALL LETTER J HACEK;;;;
01F1;LATIN CAPITAL LETTER DZ;Lu;0;L;<compat> 0044 005A;;;;N;;;;01F3;01F2
01F2;LATIN CAPITAL LETTER D WITH SMALL LETTER Z;Lt;0;L;<compat> 0044 007A;;;;N;;;01F1;01F3;01F2
01F3;LATIN SMALL LETTER DZ;Ll;0;L;<compat> 0064 007A;;;;N;;;01F1;;01F2
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

The partially final table T, with fifteen columns, below : ( I just added a rule, to easily see the number of the different columns )

•-- 01 --•--------------------------------------- 02 ------------------------------------------• 03 •- 04 •- 05 •------------------------------------------------- 06 -------------------------------------------------• 07• 08•----- 09 ------• 10•---------------------------- 11 -------------------------•12•-- 13 -•-- 14 -•-- 15 -•
| 01EF   | LATIN SMALL LETTER EZH WITH CARON                                                   | Ll | 0   | L   | 0292 030C                                                                                            |   |   |               | N | LATIN SMALL LETTER YOGH HACEK                           |  | 01EE  |       | 01EE  |
| 01F0   | LATIN SMALL LETTER J WITH CARON                                                     | Ll | 0   | L   | 006A 030C                                                                                            |   |   |               | N | LATIN SMALL LETTER J HACEK                              |  |       |       |       |
| 01F1   | LATIN CAPITAL LETTER DZ                                                             | Lu | 0   | L   | <compat> 0044 005A                                                                                   |   |   |               | N |                                                         |  |       | 01F3  | 01F2  |
| 01F2   | LATIN CAPITAL LETTER D WITH SMALL LETTER Z                                          | Lt | 0   | L   | <compat> 0044 007A                                                                                   |   |   |               | N |                                                         |  | 01F1  | 01F3  | 01F2  |
| 01F3   | LATIN SMALL LETTER DZ                                                               | Ll | 0   | L   | <compat> 0064 007A                                                                                   |   |   |               | N |                                                         |  | 01F1  |       | 01F2  |
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

It could be helpful to someone, anyway !

REMARK :

Of course, for simple and/or small lists, you’d rather avoid this method ! But the general method, described below, can manage any kind of list. Just change the original semi-colon separator and the final pipe delimiter by your own ones ! ( They may need to be escaped, in regexes )

The main steps, of this guideline, are :

A) Add a semi-colon separator, both, at the beginning and at the end of each line of the original list

B) Find the greatest amount of characters ( M), between two consecutive ; delimiters, of the original list

C) Find, for each column, of the original list, its maximum number of text characters

D) Perform the main regex S/R to change this original list, of N delimited fields, into a table, of N fixed length columns, of M characters wide.

E) Delete, with regexes, all the surplus spaces, of each column, beginning with the last one, in order to adjust the column’s width to the text of each column and get a final table, well formatted !

IMPORTANT :

A quicker approach would consist to :

Execute the points A) and B)
Directly, execute the point D) ( main regex S/R ), skipping the point C)
Modify, manually, the table, with rectangular selections, on all the lines, in order to delete pure blank areas, and get the final well-formatted table , skipping the point E)

Concerning the original list, I supposed some hypotheses true :

All the lines, of the list, contain the same number of fields
Each field is separated from the next one, with a semi-colon ( ; )
All lines, of the list, begin at column 1, WITHOUT any leading indentation
The ; delimiter must begin and end, all the lines of the list

and, concerning Notepad++ :

That original list will be copied, in a new tab, from the address below :

http://www.unicode.org/Public/8.0.0/ucd/UnicodeData.txt

The Wrap feature ( Menu View - Wrap ) must, preferably, NOT be checked

A)

First, and to satisfy my last hypothesis, we’ll add a semi-colon separator, both, at the beginning and at the end of each line of the list, in order that text of the first and the last column is surrounded by two delimiters |, as the other columns, in the table

SEARCH .+

REPLACE ;$0;

Not too hard to start with, isn’t it ?

B)

Then, we must find out the greatest amount of characters ( M), between two consecutive; delimiters. Generally, when the total rows of a list is not important, it’s easy to guess it, but, we’d better use a method, for a bigger list. So :

Open the Find dialog ( CTRL + F )
Type the regex ;\K[^;]{V,}(?=;) in the Find what zone
Replace the variable V by the number 1
Click on the Count button
As long as the number of occurrences found is > 0, increase the variable V by 1 and click, again, on the Count button
As soon as NO occurrence can be found, this MAXIMUM M is the value V+1 ( that is to say V-1 characters, surrounded by two spaces ! )

In our example, this maximum M is 102 ( 100 char. + 2 spaces ), which occurs in the 6th column of the 16017th row !

Therefore, our future table, will be built, first, with 15 columns, equally wide of 102 positions, between the two | separators !

C)

Next, we have to find out, for each column C, of the original list, the maximum number of characters, that should be kept, in order to get all the text, of each column, correctly located and surrounded with two spaces, in the future table !

As in point A), we’ll use a regex to determine these values :

Open the *Find dialog ( CTRL + F )
Type the regex ^;(.*?;){C-1}+\K[^;]{N,}(?=;) in the Find what zone
Replace the expression C-1 by the appropriate number
Change the variable N by the number 1
Click on the Count button
As long as the number of occurrences found is > 0, increase the variable N by 1 and click, again, on the Count button
As soon as NO occurrence can be found, then, the maximum number of text characters to keep ( including the leading space ), for the column C, is the current value N

Once you did this task, for the 15 columns of the list, in reverse order, beginning with the 15th column till the 1st one, you get the results below :

Column C                 Regex                  Number N
             
  15          ^;(.*?;){14}+\K[^;]{N,}(?=;)         6
  14          ^;(.*?;){13}+\K[^;]{N,}(?=;)         6
  13          ^;(.*?;){12}+\K[^;]{N,}(?=;)         6
  12          ••••••••••••••••••••••••••••         1
  11          ^;(.*?;){10}+\K[^;]{N,}(?=;)        56
  10          ^;(.*?;){9}+\K[^;]{N,}(?=;)          2
   9          ^;(.*?;){8}+\K[^;]{N,}(?=;)         14
   8          ^;(.*?;){7}+\K[^;]{N,}(?=;)          2
   7          ^;(.*?;){6}+\K[^;]{N,}(?=;)          2
   6          ••••••••••••••••••••••••••••       ••••
   5          ^;(.*?;){4}+\K[^;]{N,}(?=;)          4
   4          ^;(.*?;){3}+\K[^;]{N,}(?=;)          4
   3          ^;(.*?;){2}+\K[^;]{N,}(?=;)          3
   2          ^;(.*?;){1}+\K[^;]{N,}(?=;)         84
   1          ^;(.*?;){0}+\K[^;]{N,}(?=;)          7

Notes : :

The 12th column is totally empty. So, the leading space, only, has to be kept, for that column => N = 1. Then, this column will be displayed | |, with two spaces, between the two delimiters, in the future table
As found, in point A), some text, of the 6th column, is the longest of the list ( 100 characters ). Then, NO surplus space can be deleted, for that column, in the future table

D)

Now, we’re going to perform the main regex S/R, which will change this original list, with a semi-colon delimiter, into a table, with fixed columns, of 102 characters wide ( See point A) , above ), with aPIPE delimiter ( | )

Go back, to the very beginning of the list
Open the Replace dialog ( CTRL + H )

SEARCH ^\K(;)|.*?\K(;)(?=(.)?)|[^|\r\n]{M}\K * ( The variable M will be replaced its value 102 )

REPLACE (?1| )(?2 |(?3 )) ( There are 102 spaces, between ?2 and the | character ), that is to say :

(?1| )(?2                                           |(?3 ))

Click TWICE, on the Replace All button ( NOT on the Replace button )

=> After the first S/R, it adds 102 spaces at the end of all the columns

=> After the second S/R, it deletes any surplus spaces, after the 102th character of each column

So, we, now, get a table, partially shown ( with ONLY columns 1 and 2 complete ), as below :

•------------------------------------------------- 01 -------------------------------------------------•------------------------------------------------- 02 -------------------------------------------------•------ 03 --- •••
| 01EF                                                                                                 | LATIN SMALL LETTER EZH WITH CARON                                                                    | ••••••••••••••••
| 01F0                                                                                                 | LATIN SMALL LETTER J WITH CARON                                                                      | ••••••••••••••••
| 01F1                                                                                                 | LATIN CAPITAL LETTER DZ                                                                              | ••••••••••••••••
| 01F2                                                                                                 | LATIN CAPITAL LETTER D WITH SMALL LETTER Z                                                           | ••••••••••••••••
| 01F3                                                                                                 | LATIN SMALL LETTER DZ                                                                                | ••••••••••••••••

I just added a rule, to easily identify, the number of the different columns, of this table

Obviously, the horizontal length of the list is too important, especially because of some empty columns changed with 102 spaces ! So, we need to delete all the surplus space characters, at the end of these different columns.

E)

Finally, to delete all the surplus spaces, of each column C ( except for the column 6 ), use the general regex, below :

Go back, to the very beginning of the table
Open the Replace dialog ( CTRL + H )

SEARCH ^.(?:.{M+1}){C-1}.{N}\K +(?= \|)

REPLACE NOTHING

Click, ONCE, on the Replace All button

Remaining of the value M = 102 and of the appropriate value N, for each column ( See point C), above ) and that you must process the columns, in reverse order, beginning with the 15th column, till the 1st one, you’ll find the different regexes to run, one after another, below :

Column C      Number N                Regexes to RUN
             
  15             6            ^.(?:.{103}){14}.{6}\K +(?= \|)
  14             6            ^.(?:.{103}){13}.{6}\K +(?= \|)
  13             6            ^.(?:.{103}){12}.{6}\K +(?= \|)
  12             1            ^.(?:.{103}){11}.{1}\K +(?= \|)
  11            56            ^.(?:.{103}){10}.{56}\K +(?= \|)
  10             2            ^.(?:.{103}){9}.{2}\K +(?= \|)
   9            14            ^.(?:.{103}){8}.{14}\K +(?= \|)
   8             2            ^.(?:.{103}){7}.{2}\K +(?= \|)
   7             2            ^.(?:.{103}){6}.{2}\K +(?= \|)
   6           ••••           ••••••••••••••••••••••••••••••••
   5             4            ^.(?:.{103}){4}.{4}\K +(?= \|)
   4             4            ^.(?:.{103}){3}.{4}\K +(?= \|)
   3             3            ^.(?:.{103}){2}.{3}\K +(?= \|)
   2            84            ^.(?:.{103}){1}.{84}\K +(?= \|)
   1             7            ^.(?:.{103}){0}.{7}\K +(?= \|)

Et voilà !

After these 14 S/R, based on the same regex, the UNICODE table T, like above, should be well formatted :-)

NOTE :

In this final table, to see the greatest amount(s) of text, for a given column C :

Go back to the very beginning of this table
Type the regex ^\|(?:.+?\|){C-1}+\K[^|]*[^ |] (?=\|) in the Find what zone
Replace the expression C-1, by the appropriate number
Click on the Find Next button

Particularly, run the SEARCH regexes, below, with values 1, 5 , 8 and 10, for C-1, to know the maximum of text, in columns 2, 6, 9 and 11 !

Best Regards

guy038

P.S. :

BTW, it’s very simple to get back to the original list, with ; separated fields, from any table, with aligned columns, with the | delimiter :

SEARCH \h*\|\h?

or, simply, if NO tabulation character exists :

SEARCH *\| ? ( with a space, both, before the star and the interrogation mark )

and

REPLACE ;