Regex to replace XML settings file tags to INI-like config param and value
-
Hello everyone,
Could you please help me the the following search-and-replace problem I am having?
I am taking XML based configuration files, and converting sections of them to something that resembles an INI file data structure.
I found an online converter but I would like to keep within my workspace of Notepadd++ as I select text and process as I go.
Here is the data I currently have (“before” data):
<time_date> <date_format>DD/MM/YY</date_format> <hr24_clock>1</hr24_clock> <ntp_dhcp_option>0</ntp_dhcp_option> <ntp_server>1</ntp_server> <ntp_server_addr>time1.google.com</ntp_server_addr> <ntp_server_update_interval>1000</ntp_server_update_interval> <timezone_dhcp_option>0</timezone_dhcp_option> <selected_timezone>America/New_York</selected_timezone> </time_date>
Here is how I would like that data to look (“after” data):
time_date.date_format = DD/MM/YY time_date.hr24_clock = 1 time_date.ntp_dhcp_option = 0 time_date.ntp_server = 1 time_date.ntp_server_addr = time1.google.com time_date.ntp_server_update_interval = 1000 time_date.timezone_dhcp_option = 0 time_date.selected_timezone = America/New_York
KEY POINTS: ● The XML section becomes front loaded on the param. ● There must be a space before, and after, the equals symbol.
Thank you!
-
Hello, @gamophyte and All,
I suppose that the following regex S/R should just meet your goal !
So, starting with the text, below, where I’ve, intentionally, added / deleted empty lines :
<time_date> <date_format>DD/MM/YY</date_format> <hr24_clock>1</hr24_clock> <ntp_dhcp_option>0</ntp_dhcp_option> <ntp_server>1</ntp_server> <ntp_server_addr>time1.google.com</ntp_server_addr> <ntp_server_update_interval>1000</ntp_server_update_interval> <timezone_dhcp_option>0</timezone_dhcp_option> <selected_timezone>America/New_York</selected_timezone> </time_date> <test_one> <date_format>DD/MM/YY</date_format> <hr24_clock>1</hr24_clock> <ntp_dhcp_option>0</ntp_dhcp_option> <ntp_server>1</ntp_server> <ntp_server_addr>time1.google.com</ntp_server_addr> <ntp_server_update_interval>1000</ntp_server_update_interval> <timezone_dhcp_option>0</timezone_dhcp_option> <selected_timezone>America/New_York</selected_timezone> </test_one> <test_two> <date_format>DD/MM/YY</date_format> <hr24_clock>1</hr24_clock> <ntp_dhcp_option>0</ntp_dhcp_option> <ntp_server>1</ntp_server> <ntp_server_addr>time1.google.com</ntp_server_addr> <ntp_server_update_interval>1000</ntp_server_update_interval> <timezone_dhcp_option>0</timezone_dhcp_option> <selected_timezone>America/New_York</selected_timezone> </test_two>
-
Open your “source” file in N++, or, preferably, the text above in a new tab, for testing !
-
Open the Replace dialog (
Ctrl + H
) -
Unckeck all box options
-
Check the
Wrap around
option -
Select the
Regular expression
seach mode
SEARCH
(?-is)^\h+<(.+)>(.+)</.+>\h*\R+(?=(?s:.+?)^\h+</(.+)>$)|^\h+<(/)?.+>\h*\R+
REPLACE
(?1$3.$1 = $2\r\n:(?4\r\n:)
- Click once only the
Replace All
button
=> You should get your expected OUTPUT text :
time_date.date_format = DD/MM/YY time_date.hr24_clock = 1 time_date.ntp_dhcp_option = 0 time_date.ntp_server = 1 time_date.ntp_server_addr = time1.google.com time_date.ntp_server_update_interval = 1000 time_date.timezone_dhcp_option = 0 test_one.selected_timezone = America/New_York test_one.date_format = DD/MM/YY test_one.hr24_clock = 1 test_one.ntp_dhcp_option = 0 test_one.ntp_server = 1 test_one.ntp_server_addr = time1.google.com test_one.ntp_server_update_interval = 1000 test_one.timezone_dhcp_option = 0 test_two.selected_timezone = America/New_York test_two.date_format = DD/MM/YY test_two.hr24_clock = 1 test_two.ntp_dhcp_option = 0 test_two.ntp_server = 1 test_two.ntp_server_addr = time1.google.com test_two.ntp_server_update_interval = 1000 test_two.timezone_dhcp_option = 0
Notes :
-
Within the OUTPUT text, the first part before the dot is based on each corresponding
</....>
closing tag, within the INPUT text -
Empty lines, between two items of a section, are simply ignored
-
Additional empty lines or missing empty line separator, between two sections, are normalized to a single empty line
Best Regards,
guy038
-
-
@guy038 AMAZING!
However I realize I can’t use this wholesale on the whole document. I’m finding that some settings aren’t within a section after all.
In those cases, it appears to grab a value from a neighbor and uses that as the section name.
It’s no issue just to select each section manually, and so I will just need another Regex for when the section is flat.
Like just below in our running example:
<date_format>DD/MM/YY</date_format> <hr24_clock>1</hr24_clock> <ntp_dhcp_option>0</ntp_dhcp_option> <ntp_server>1</ntp_server> <ntp_server_addr>time1.google.com</ntp_server_addr> <ntp_server_update_interval>1000</ntp_server_update_interval> <timezone_dhcp_option>0</timezone_dhcp_option> <selected_timezone>America/New_York</selected_timezone>
-
Hi, @gamophyte and All,
Ah… OK So, here is, below, a regex which just partially change the contents of each section as well as any
tag-value
pair outside a section !So, given theis new INPUT text, below :
<time_date> <date_format>DD/MM/YY</date_format> <hr24_clock>1</hr24_clock> <ntp_dhcp_option>0</ntp_dhcp_option> <ntp_server>1</ntp_server> <ntp_server_addr>time1.google.com</ntp_server_addr> <ntp_server_update_interval>1000</ntp_server_update_interval> <timezone_dhcp_option>0</timezone_dhcp_option> <selected_timezone>America/New_York</selected_timezone> </time_date> <test_one> <date_format>DD/MM/YY</date_format> <hr24_clock>1</hr24_clock> <ntp_dhcp_option>0</ntp_dhcp_option> <ntp_server>1</ntp_server> <ntp_server_addr>time1.google.com</ntp_server_addr> <ntp_server_update_interval>1000</ntp_server_update_interval> <timezone_dhcp_option>0</timezone_dhcp_option> <selected_timezone>America/New_York</selected_timezone> </test_one> <date_format>DD/MM/YY</date_format> <hr24_clock>1</hr24_clock> <ntp_dhcp_option>0</ntp_dhcp_option> <ntp_server>1</ntp_server> <test_two> <date_format>DD/MM/YY</date_format> <hr24_clock>1</hr24_clock> <ntp_dhcp_option>0</ntp_dhcp_option> <ntp_server>1</ntp_server> <ntp_server_addr>time1.google.com</ntp_server_addr> <ntp_server_update_interval>1000</ntp_server_update_interval> <timezone_dhcp_option>0</timezone_dhcp_option> <selected_timezone>America/New_York</selected_timezone> </test_two> <ntp_server_addr>time1.google.com</ntp_server_addr> <ntp_server_update_interval>1000</ntp_server_update_interval> <timezone_dhcp_option>0</timezone_dhcp_option> <selected_timezone>America/New_York</selected_timezone>
Then, the folowing regex S/R :
SEARCH
(?-s)<(.+)>(.+)</\1>\h*
REPLACE
.$1 = $2
Would leave you with this OUTPUT text :
<time_date> .date_format = DD/MM/YY .hr24_clock = 1 .ntp_dhcp_option = 0 .ntp_server = 1 .ntp_server_addr = time1.google.com .ntp_server_update_interval = 1000 .timezone_dhcp_option = 0 .selected_timezone = America/New_York </time_date> <test_one> .date_format = DD/MM/YY .hr24_clock = 1 .ntp_dhcp_option = 0 .ntp_server = 1 .ntp_server_addr = time1.google.com .ntp_server_update_interval = 1000 .timezone_dhcp_option = 0 .selected_timezone = America/New_York </test_one> .date_format = DD/MM/YY .hr24_clock = 1 .ntp_dhcp_option = 0 .ntp_server = 1 <test_two> .date_format = DD/MM/YY .hr24_clock = 1 .ntp_dhcp_option = 0 .ntp_server = 1 .ntp_server_addr = time1.google.com .ntp_server_update_interval = 1000 .timezone_dhcp_option = 0 .selected_timezone = America/New_York </test_two> .ntp_server_addr = time1.google.com .ntp_server_update_interval = 1000 .timezone_dhcp_option = 0 .selected_timezone = America/New_York
Do you find this kind of regex more useful for you ?
Of course, you’ll need to use a column-mode selection to add section names, right before the dot characters !
BR
guy038
-
This takes me far and beyond and away further than I was before, thanks!!
Because even now you can use that front loaded dot as an anchor to do way more, like clean it up if no container (section) encapsulating it, and if there is I can S&R to put it on the front - even if it’s done by selecting at a time.
You’ve taken it far enough, thanks for your work!!