Filter using columns?

ZNC Gatilho

Good morning. I marked in red in the image the columns that interest me in this document. This is the list of all chess players registered with FIDE, which is freely available to the public at this address (I can’t post the link because human beings love to complicate simple things!.) I would like to know if notepad++ can filter only what interests me using the columns? For example: I would like it to show me only players with standard ratings from 2100 to 2900, the column responsible for ratings is (SRtng) marked in red, this would be my first player filter, then I would like to filter only players titled, the column responsible for this is (Tit) marked in red in the document. fide provides the lists in xml and txt, I downloaded the txt, I downloaded the full version which is over 200mb. I don’t know if xml would be better for what I want to do, I don’t understand any of these things. Trying to filter by arm would be unfeasible, the list has almost a million names. Thank you for your attention. PS: I want to inform you that I don’t work with this, the list will be for my personal use, later I will create a database to study the matches of the players that interest me. I don’t understand anything about regular expressions, so any help, if it’s possible to do what I want, will have to be for someone who is completely new to the subject. alt text

Alan Kilborn

@ZNC-Gatilho said in Filter using columns?:

I can’t post the link because human beings love to complicate simple things!

No, you can’t post the link because human beings do DUMB things. The fact that you can’t post a link is a countermeasure to these dumb things.

Alan Kilborn

What you’re asking for is a rather ambitious ask for Notepad++ and/or the people here. What you have is a “searching” question, and thus you should read and follow the instructions HERE to give people interested in helping you a better starting point.

PeterJones

@ZNC-Gatilho said in Filter using columns?:

This is the list of all chess players registered with FIDE, which is freely available to the public at this address (I can’t post the link because human beings love to complicate simple things!.)

And the “human beings” who complicate things are the spammers who have inflicted non-stop AI/bot-based posts on our small community. Preventing accounts with 0 reputation from posting links has dramatically reduced the amount of spam we have to deal with. The best way to overcome that is to post one well-asked question, with textual data that’s easy to copy/paste, following the FAQ that @Alan-Kilborn linked you to before I finished writing this post: one such good post is usually enough to garner the single upvote necessary.

And since the link is irrelevant to the question (unless someone wanted to download a 200mb file, just to help you), it’s not overly important to the discussion.

However, you could have pasted about about a screen’s worth of text from the textfile – ie, the one screenshot worth of data from your image, but as text that we could copy/paste – using the </> button found in the create-a-post toolbar, as described in our formatting posts FAQ

… which would have made it much easier for people to help you, without having to download a 200mb file.

I would like to know if notepad++ can filter only what interests me using the columns?

Any filter done in Notepad++ would be destructive – ie, removing data from the file. Make sure you have a backup before you try anything suggested in response to your query.

For example: I would like it to show me only players with standard ratings from 2100 to 2900,

Yeah, well, text editors aren’t great at numerical comparisons. That’s not what they’re designed for. The file you have is a text report from some database application, which isn’t the best for filtering after being generated.

What would be best for you (though it wouldn’t involve Notepad++) is for FIDE to give you access to an API for the database itself, rather than just letting you look at reports. Second best (again, not using Notepad++) would be to parse the text file with some tool that would turn it back into a database on your end, and then use the tools with that database to do your extraction to create a new report with the filtered data that you want. Third best would be to import it into a spreadsheet – I know that Excel can take a textfile input, and then prompt you to manually insert the column divisions (and probably LibreOffice’s Calc can as well); once it’s in the spreadsheet, you can use the spreadsheet “filter” tool to do your filtering quite easily.

Destructive search-and-replace in Notepad++ would be fourth in line (at best… there might be other better options between spreadsheet and Notepad++ search-and-replace that I cannot think of)… but it’s the only one that we can help you with here.

I don’t know if xml would be better for what I want to do,

In pure Notepad++, I’m doubtful that it would make it any easier – definitely not if you end up having to use regex, because parsing XML with regex is no easier than parsing fixed-width text table with regex. With the XML Tools plugin or other XML processing plugin, it might be better, if you knew the XPath syntax for searching (assuming XML Tools has an XPath searcher… I don’t honestly remember right now).

I would think that a JSON file might be easier to search/replace than an XML file, but I cannot guarantee that. And one of the JSON plugins (JSON Viewer or JSON Tools) has a builtin filter/searching, which probably would be helpful. So if FIDE has a JSON version of the data, I might suggest that

PS: … I don’t understand anything about regular expressions, so any help, if it’s possible to do what I want, will have to be for someone who is completely new to the subject

If you are unwilling to learn, you are unlikely to find help here. If you are willing to learn, someone might be able to come up with a regex that will help. The way you’ve phrased this PS, I cannot tell if you are willing to learn or not.

In Notepad++ proper, the only way to come close to your goal would be to use regular expressions. Though since one of your filters is mathematical in nature, I will direct you to our Math replacement FAQ for an explanation of why that won’t be simple,

I’m betting that @Coises’s Columns++ plugin will be able to help you – unfortunately, it’s still in beta, so you’ll have to manually install it, since you cannot use Notepad++'s Plugins Admin to install the plugin. But that plugin has its own regex engine which might handle your columnar data better than a vanilla Notepad++ regex; and it has an added feature of mathematical notation (though I’m not sure if that only works in the replacement, or if it can also work for comparing a range of numbers in the search… @Coises will likely be around and will hopefully chime in).

Since I’m not a Columns++ expert, if I were interested in doing this, I’d pull out the PythonScript plugin (if I was forced to do it in Notepad++) or my favorite scripting language (if I were doing it solely for my own use, not for this Forum) and I would write the script in Notepad++, but wouldn’t have to have the 200mb file open in Notepad++, since that’s pretty slow for N++.

----

Useful References

----

Please note: This Community Forum is not a data transformation service; you should not expect to be able to always say “I have data like X and want it to look like Y” and have us do all the work for you. If you are new to the Forum, and new to regular expressions, we will often give help on the first one or two data-transformation questions, especially if they are well-asked and you show a willingness to learn; and we will point you to the documentation where you can learn how to do the data transformations for yourself in the future. But if you repeatedly ask us to do your work for you, you will find that the patience of usually-helpful Community members wears thin. The best way to learn regular expressions is by experimenting with them yourself, and getting a feel for how they work; having us spoon-feed you the answers without you putting in the effort doesn’t help you in the long term and is uninteresting and annoying for us.

Mark Olson

I could talk about ways you could use various plugins (including plugins I’ve made) to approach this problem, but IMO Notepad++ is not a good tool for this. Sometimes I’ve tried using NPP plugins to analyze or manipulate really huge tabular datasets like this, and in the end I found it easier to use the most popular and well-known tools.

I personally would use Python and Jupyter Notebooks for this, and if you want any more details it has to be in chat because the details of a non-NPP solution are not pertinent to this forum.

ZNC Gatilho

I thank Peter and the others who responded. I was going to detail a few more things, but I give up! There’s so much blocking that it’s impossible, congratulations to Akismet. The internet has become a paradise where the good pay for the bad. Since I lack computer knowledge, but I don’t lack money, I will resolve it in other ways.

Coises

@PeterJones said in Filter using columns?:

Coises will likely be around and will hopefully chime in

Perhaps I’m missing something… but from the screenshots @ZNC-Gatilho included, this isn’t very hard. The text file appears to be in fixed-width columns. No need for any plugin; just make a rectangular selection around the column of interest, sort, and then delete everything except the range of interest. A numeric sort will work on the SRtng column and a lexicographic sort on the others.

Obviously, make a copy of the file first.

guy038

Hello, @znc-gatilho, @alan-kilborn, @peterjones, @mark-olson, @coises and All,

@znc-gatilho, could you confirm me that your three important zones are, by priority order :

Zone 'SRtng' : Between columns 114 and 118  ( 5 chars )
Zone 'Tit'   : Between columns  77 and  79  ( 3 chars )
Zone 'WTit'  : Between columns  82 and  85  ( 4 chars )

If so, let’s start with this INPUT text, containing the headers line 1 and 3 random records ( lines 626897, 626912 and 626917 )

ID Number      Name                                                 Fed Sex Tit  WTit OTit                   FOA SRtng SGm SK RRtng RGm Rk BRtng BGM BK B-day Flag
4109228        Kryukova, Irina                                      RUS F   WGN  WNM                             2232  0   20              2240  0   20 1968  wi
36016672       Kryvinski, Jenny                                     FRA F                                                                               2001  w
14102897       Kryvolapov, Kostiantyn                               UJR M   IM                                   2356  1   10              2327  0   20 1962

Open these four lines text in a new tab
Move to the very beginning of your file ( Ctrl + Home )
Open the Replace dialog ( Ctrl + H )
Un-tick all box options
Type in (?-s)^.{76}(...)..(....).{28}(.....).+ in the Find what : zone
Type in \3 \1 \2 $0 in the Replace with : zone
Select the Regular expression search mode
Click on the Replace All button

=> At once, you should get this temporary text :

SRtng Tit WTit ID Number      Name                                                 Fed Sex Tit  WTit OTit                   FOA SRtng SGm SK RRtng RGm Rk BRtng BGM BK B-day Flag
2232  WGN WNM  4109228        Kryukova, Irina                                      RUS F   WGN  WNM                             2232  0   20              2240  0   20 1968  wi
               36016672       Kryvinski, Jenny                                     FRA F                                                                               2001  w
2356  IM       14102897       Kryvolapov, Kostiantyn                               UJR M   IM                                   2356  1   10              2327  0   20 1962

Now, select the three lines, after the headers line
Run the Edit > Line Operations > Sort Lines Lexicographically Ascending option

=> You should get your expected OUTPUT text :

SRtng Tit WTit ID Number      Name                                                 Fed Sex Tit  WTit OTit                   FOA SRtng SGm SK RRtng RGm Rk BRtng BGM BK B-day Flag
               36016672       Kryvinski, Jenny                                     FRA F                                                                               2001  w
2232  WGN WNM  4109228        Kryukova, Irina                                      RUS F   WGN  WNM                             2232  0   20              2240  0   20 1968  wi
2356  IM       14102897       Kryvolapov, Kostiantyn                               UJR M   IM                                   2356  1   10              2327  0   20 1962

As you can see your file is, automatically sorted by the three zones SRtng then by Tit then by WTit ascending

BTW, @Peterjones, I suppose that the SRtng zone is the ELO rating of chess players, which is ALWAYS a 4 number string for experienced players and grand masters ! . So, no trouble for the sort

Now, just repeat the steps above, with your players_list_FOA.txt file

Of course, do a backup of your players_list_FOA.txt, before any action

Beware that your file is really an huge file ( 216,785,860 bytes for 1,321,866 lines !! )

Thus, I cannot tell you how longer the search/replacement and the save operations will take till completion, when applied to THIS file !

However, you can give it a try !

Of course, in case that the zones do not belong to the mentioned ranges, just tell me the exact values in order to modify the search regex, accordingly :-)

Best Regards,

guy038