Problem reading diacritics after using a PDF Extract/convert software

Vasile Caraus

hello, and Merry Christmas !

just a question. I test more pdf-extract free software. Some are very good. Anyway, the problem is after convert pdf to txt, when I open with notepad++, the diacritics like: ă, ş, ţ, î cannot be read. Instead they are replaced with signs like : \ { ] …etc .

The problem is at most of the software, but only one software works (but I have to purchase, and it’s not worth it)

So, I don’t know why only one software converts well, and the others doesn’t. The problem is from notepad++, or the softwares?

glossar

There are couple of things to take into consideration. 1. If pdf file is not “dead” (i.e. Contains -scanned- image text), without converting, try copying the text in the pdf file and paste into e.g. Word file and see the result. 2. Make sure the chosen font in your notepad supports those characters. Unless you dont use some non-system, additonally installed fancy font, you should be able to see them (e.g with courier new font). 3. Make sure encoding is set to unicode utf-8 in notepad and (if applicable) in the pdf converter that you are using. That’s all i can tell for now. It is hard ans tiresome to type in phone. :-)

list item

Vasile Caraus

so, my notepad++ is set to open in UTF-8 automat. But all .txt files, after convert, are open in UTF-8-BOM.

The problem is the .pdf in one case. Because, even if I copy/paste into a new notepad++ file, is the same problem. Maybe this .pdf will help you to understand my problem:

https://www.keepandshare.com/doc18/12485/test-file-pdf-949k?da=y

glossar

Hello Vasile,

I cannot download the file as I don’t have any account on that site. If you upload the pdf file to another file-sharing site that requires no registration/login, I can download and have a look into it.

In the meantime, I have prepared two test files: The text is a section from the Diacritics article of Wikipedia and contains a lot of diacritics. The text file is utf-8 (without BOM) encoded and the pdf one is the version of it, printed to pdf in MS Word. You should be able to copy and paste the pdf text, without converting, directly into a Word/Notepad file. Here, all the diacritics are properly displayed both in Notepad (with Cousine as the default font) and in Word.

The links to the said files are:

http://wikisend.com/download/105846/diacritics.txt
http://wikisend.com/download/328386/diacritics.pdf

Best,
Glossar

Vasile Caraus

thanks for your files, my file is different strange :)

I upload again here:

https://ufile.io/b8c0e

glossar

Hi,

I cannot download it and receive this message: “The server at uploadfiles.io is taking too long to respond”

Vasile Caraus

then, try this please:

http://wikisend.com/download/312182/test_file.pdf

glossar

I now see the problem.

I have looked into the file properties in Adobe Reader. The embedded fonts (more accurately subsets of the fonts used) are ANSI-encoded. (See the screenshot linked below.) This means the code-page of the system, in which that particular pdf was created - if it is for example Windows-1252, then the ANSI is also that. So, unless this ANSI (whichever encoding it might be) and system encoding match, which in our cases it seems to be, such difference in characters will occur.

Another possible reason, and I am not sure if this is essentially the same as above one, the fonts used are themselves ANSI-encoded, instead of Unicode. Which means, even if the character numbers under the hood are the same, different characters are assigned to them, much like the same address with different residents (in two different systems).

The solution that springs to my mind, and it is not even worth trying, is to turn the pdf into a dead one, i.e. image text, and run OCR with related software/pdf converters.

https://snag.gy/JRakAH.jpg

Vasile Caraus

@glossar said:

t is not even worth trying, is to turn the pdf into a dead o

hello Glossar. Thanks for the idea with pdf to images. This is a possible good solution, but if there are many many pdf (like 200) witch I want to convert to .txt, will be very hard.

So, I find another solution. After convert all files, I will use option “Find and Replace all in all folder”, all those symbols (like [,],) that replace the diacritics: These symbols have the same function and repeats all the time.

] = t
\|á = a
[ = s
~ = i