Join lines from clipboard data
-
So I’m not really sure if this is possible, but if someone knows a way it would make my life much easier.
Essentially I have to copy lots of data from PDFs to write the data into JSON format. Unfortunately copying the text from the PDF introduces many unnecessary line breaks. My solution, and workflow that I’d like to cut down on goes like this:
Copy lines from PDF
Paste into separate Notepad++ file
Ctrl+J to join the lines
Re-select everything and cut
Paste the data where it needs to go in the JSONUltimately this process is very tedious. What I’d like to do is be able to copy the lines from the PDF I want, and then be able to join the lines in the clipboard data and paste it in the JSON in one go. Is this possible at all? Thank you very much in advance.
-
If you’re willing to install and use the PythonScript plugin for Notepad++, then a shorter workflow is possible. That workflow would be:
- Copy lines from PDF
- Move caret to insertion point in the JSON file in its Notepad++ tab
- Run the script (could be tied to a key-combination for easy-running)
The script would look like this:
# -*- coding: utf-8 -*- from Npp import notepad, editor try: editor3h # third editor, hidden except NameError: editor3h = notepad.createScintilla() if editor.canPaste() and editor3h.canPaste(): editor3h.selectAll() editor3h.paste() editor3h.rereplace('\r\n', ' ') editor3h.selectAll() editor3h.copy() editor.paste()
-
@Jonathan-Wendt said in Join lines from clipboard data:
Copy lines from PDF
Paste into separate Notepad++ file
Ctrl+J to join the lines
Re-select everything and cut
Paste the data where it needs to go in the JSONWhy not just paste it where you want it, and do the join there? That cuts out 2 of the 5 steps.
But it can be done with a script using a plugin like PythonScript… which I was working on when @Alan-Kilborn posted his. I guess I was too slow.
-
@Alan-Kilborn This was EXACTLY what I was looking for! Thank you so much.
-
Hello, @alan-kilborn and All,
I was intrigued with the statement :
editor3h # third editor, hidden
Do you mean that
editor3h
is an object which represents a virtual document, where you can perform any manipulation, independently fromeditor1
representing the main view document andeditor2
representing the secondary view document ?Now , I have no problem with the commands below, because it’s obvious that, after an S/R, wherever this data is located, you need to select the modified results, copy them in the clipboard and paste them somewhere ( At the insertion point of the
JSON
file in this specific example )editor3h.rereplace('\r\n', ' ') editor3h.selectAll() editor3h.copy() editor.paste()
But, as you said that the first step was to copy
PDF
file contents ( …in the clipboard ), I don’t see the purpose of the firsteditor3h.selectAll()
command. To my mind, theeditor3h.paste()
command, which would copy thePDF
contents of the clipboard to the virtual editor document, should be enough !?From a
Python
beginner ;-))Best Regards,
guy038
-
@guy038 said in Join lines from clipboard data:
Do you mean that editor3h is an object which represents a virtual document, where you can perform any manipulation, independently from editor1 representing the main view document and editor2 representing the secondary view document ?
Short answer to long question:
Yes. :-)I don’t see the purpose of the first editor3h.selectAll() command
To my mind, the editor3h.paste() command, … should be enough !?
You are probably only thinking about the first time the script is run.
On first running,editor3h
does not exist, so it is created.
At that point there is no text in its document, so the select-all does effectively nothing.
On second and further runs of the script,editor3h
already exists – we don’t need to have N++ go thru the overhead of create a new one each time, we can just reuse the old one.
But in this case, there is already text in the document (from the previous run).
Selecting all of the text and then doing a paste effectively causes editor3h to then only contain the data from the paste.
I could have done it in other ways, e.g.,editor3h.setText("")
as well.BTW, I could have dispensed with the
editor3h
technique altogether and just pasted into the user’s current document and manipulated from there. There are a lot of ways to do things (TMTOWTDI). But that way tends to get a bit messier because when you do aneditor.paste()
and you need to manipulate what you pasted, the size of what you pasted – so you know where to find the data in the doc – is not provided automatically and takes some effort to calculate. Now if text remained selected upon pasting, it would be easy. -
Hi, @Alan-kilborn and All,
My God ! Of course, I didn’t think about successive runs ! And it’s quite logical :
-
The
editor3h.selectAll()
command, first, select all contents ofeditor3h
, if any -
The
editor3h.paste()
command replaces this possible selection with the clipboard’s contents and paste it ineditor3h
Thanks,also, for your additional information :
BR
guy038
-
-
:-)
Perhaps this
editor3h
technique could help you with the regex stuff you do. A short script is a probably easier to work with than macros (for those cases where you are building up several regex replacements in a row). The above script seems to contain all the examples you’d need.One thing to be careful of is, when using Python strings to contain regexes: Sometimes you want “raw” strings, and sometimes plain strings. I used a plain string above, with
'\r\n'
, which encodes a CR and a LF (because the backslash is an “escape”). Often with regexes, though, you want backslashes to actually be themselves, in which case you do your string with an r out front, e.g.r'...'
orr"..."
.These should be equivalent, but which one is easier to read when you’d rather be thinking about regex content, rather than proper backslashing?:
myregex1 = "\\(.*?\\)" # find literal ( followed by some chars followed by literal ) myregex2 = r"\(.*?\)" # same, but easier to read and think about
-
I used a plain string above, with ‘\r\n’, which encodes a CR and a LF (because the backslash is an “escape”)
Probably that was a bad example, to illustrate my point about raw strings versus not, in Python. Why?
Because
editor.rereplace('\r\n'), ' '
andeditor.rereplace(r'\r\n', ' ')
will do exactly the same thing (for different reasons) when run.In the first case, without the r, the literal characters for carriage-return and line-feed will be searched for; in the second case, they are first interpreted – as regular expression
\r
and\n
, respectively. In the end it is the same effect, but…bad example for illustrating the point.Also, I probably should have used
r'\R'
, anyway – and that one is really obvious that we are handing over interpretation to the regex engine, because there is no such\R
character! -
Hi, @alan-kilborn,
As you just spoke about normal and raw strings, this made me remember of something weird in Python world !
I used your script, below, to test the statistics of some files, included
.exe
oneshttps://community.notepad-plus-plus.org/post/61801
In the initial script, the search regex is simply
'\w+'
. But I needed this regex'(\x00?[A-Za-z0-9_])+'
and it did not work. I had to use raw string, so the regexr'(\x00?[A-Za-z0-9_])+'
!
For instance,
-
In a new tab, type the simple string
@ABCDE
-
Run the tiny script, below, executed from the console ( May be, Alan, it’s not a good Python construction but it works !! )
console.clear() ; editor.research ('(\x40?\w)+', lambda m: console.write (m.group(0)))
It correctly write the string
@ABCDE
on the console-
Now, change the
@
char with theNUL
char, with the help of theCharacter Panel
-
Run this similar script :
console.clear() ; editor.research ('(\x00?\w)+', lambda m: console.write (m.group(0)))
This time, we get an error !
Then, run this third try, using a raw string :
console.clear() ; editor.research (r'(\x00?\w)+', lambda m: console.write (m.group(0)))
Wow ! It works as expected and displays the
NUL
char followed with the stringABCDE
, on the consoleI hope, I’m not disturbing you too much ;-))
See you later,
Cheers ( almost the right moment, in France ! )
guy038
-
-
Without looking into your specific examples just now – I’ll do that later, after my own “Cheers!” moments – I’ll say:
If you follow the rules, you get what is expected. If you don’t follow the rules, you get “unexpected” behavior – which could sometimes be what you expect, or often not!
The approximate rules are:
-
If you are going to use a regular string, i.e.,
"..."
and that string is going to contain a literal backslash, you must double that backslash (to escape it). -
If you are going to use a “raw” string, i.e.,
r"..."
, then you do NOT (typically!) double any backslashes. This would be the preferred way to do it for someone that writes a lot of literal paths in their source code, or someone that works heavily with regex.
Of course, there’s a bit more to the story, but that will come in a later posting…
There’s also “byte strings” and “unicode strings”, denoted
b"..."
andu"..."
respectively, but these are mainly (but not always) Python3 things, and Python3 for PythonScript is still in beta, so we won’t really discuss them.Also note that we aren’t going too far off the topic of N++ with this, just enough for scriptwriters – definitely “on-topic”.
-
-
So back to your examples.
It all does make sense.
Here’s how:The first example is
'(\x40?\w)+'
.What happens here is that Python encodes
\x40
as a single character and passes it to theresearch
function. Because\x40
is@
it works.Had you done,
r'(\x40?\w)+'
instead, Python would NOT have touched the\x40
part, and would have passed it as 4 distinct characters. Theresearch
function would have passed these same 4 characters into the Boost regex engine, and inside there it would have seen the notation\x40
and converted 4 characters into one, equivalent to@
. So the difference is WHERE the encoding of the four characters is done.Thus, for the first example, it is successful either way, with a normal Python string or a raw string. It just gets there by different routes.
The second example is
'(\x00?\w)+'
. What happens here is that the\x00
gets encoded into a single NUL byte at Python decode time, and because (often) C strings are NUL-terminated, meaning the NUL signals the end of the string, what happens is that only a single(
– what comes before the NUL in your string – is seen by the regex engine. The regex engine returns that that is clearly a badly formed regex, and says:RuntimeError: Unmatched marking parenthesis ( or \(.
When we move to the third example (related to the second), specifically:
r'(\x00?\w)+'
, you are again shifting the decoding to late in the game – when Boost gets hold of it. Thus, here Boost sees(\x00?\w)+
– each individual character in that – and things work. The string won’t get NUL-terminated early. Boost gets a chance at the whole string.Hopefully this makes sense.
The bottom line is that Python will use some of the same “escape sequences” that regex will use, if you give it a chance to. By “give it a chance to”, I mean by using a plain string with the syntax
"..."
(note, no little r preceding).If you want Python to leave your strings alone, so that they are passed fully as typed, to something else, e.g., the regex engine, wrap them as a raw string, i.e.,
r"..."
.So, what happens to the
\w
in the first example, when Python sees it? There is no r out front, so Python scans the string and can do things to it. Well, the answer is that Python does not know what\w
is, so that it sends it on as 2 characters,\
andw
. Thus this goes on to Boost just as intended.There’s yet another caveat. In a
r"..."
string, normally any\
are just literal characters, but if you want your final character in the string to be a backslash, you must double it! Thus if you typea=r'abc\'
at the console, you’ll get an error:SyntaxError: EOL while scanning string literal
so you’d need to do:
a = r'abc\\'
instead. This most often comes up when specifying a directory, example:
mydir = r"c:\mydir1\mysubdir2\\"
So just some explanation about this; hopefully it helps in some way.
-
Hi, @alan-kilborn and All,
Ah, very interesting insight, indeed ;-))
Yes, it’s not sometimes easy to know when characters are interpreted, if different nested processes are involved !
So, globally, in most situations, it seems better to always use raw strings. I mean… when using regexes, of course
BR
guy038
-
@guy038 said in Join lines from clipboard data:
it seems better to always use raw strings. I mean… when using regexes, of course
I would say this is true.
Probably most normal Pythoners don’t use a lot of backslashes in their strings, so the simple
"..."
syntax works fine.And you certainly can use
"..."
with regex as well, but just remember to “double up” every backslash if you do. (But that can make your regexes a nightmare to look at).