Editing 600 mega XML file

dail

Keep in mind handling very large files is difficult. There are editors made for editing large files by only loading up parts of the file at a time. If you have a 10GB file, N++ would have to load the entire thing into memory. And on top of that, if it is something like XML there is alot more overhead due to parsing the file, storing the style information for each character, folding states, etc. For that large of files you’re better of using some external program or scripting language, since the entire thing doesn’t have to be loaded into memory usually.

Ricardo

@Maor-Bachar said:

seem like it doesn’t have XML style?

I think they unfortunately restrict extra coloring schemes to their paid version: http://www.editpadpro.com/cscs.html

AJ Baxter

What size XML can you load in N++ ?

Ricardo

I don’t think file type matters, but…
If I have no plugins and no other file opened, I can open files with maximum size around 362MB, before it displays “File is too big to be opened” error.
But with my plugins enabled, the limit lowers to about 150MB.

Ricardo

Hello, for people needing a x64 build of Notepad++ I made one here.

Notes:

This is an easy unofficial build – not tested by devs.
You need 64-bit OS.
It is not compatible with 32-bit plugins (ie. all available plugins). You need to rename or move to other place your /plugins folder.
It can open very large files. Important: make sure “Word Wrap” is disabled before opening such files.
The 7z package above contains only the binaries you need to replace. Please do a backup of your current 32bit files (rename or move).

David Bailey

I really think that anyone requiring to edit a 600MB XML file should be thinking of alternatives! I mean those files contain structures, and you really need to use something that will respect that structure - which isn’t a text editor.

I don’t know what this file contains, but it might be useful to arrange that it was stored as a number of much smaller files. Alternatively, I guess you could load its structure into Mathematica, manipulate it, and write it out again!

I don’t like to see a tool like NP++ being pushed to perform basically ridiculous tasks.

David

Klaus Lehmann

hi david
excuse me.
be happy, that You never had to edit a 10GB-xml-file.
therefor notepad++ isn’t NOT Your friend ;-)
but, I want to give y try to ricardo’s 64bit edition! it’s new to me ;-)
Yours klaus

David Bailey

Klaus,

“be happy, that You never had to edit a 10GB-xml-file.”

I am indeed - but I am also happy that I have never tried to solder electronic components with a blow torch, or fry an egg on a smoothing iron! Even if I had, I wouldn’t try to make suggestions on a blow torch forum regarding improvements to blow torches that might make that more feasible!

I think a lot of software gradually bloats out - both in bytes and in terms of complexity. Ultimately valuable software can die that way! I think NP++ has remained consistently focussed on providing solutions for those who need to edit normally sized text files - particularly program source - and trying to add on outlandish capabilities would be a severe distraction.

Switching to 64-bits would probably only be the first step in a project to edit 10 GB data files, because I am sure there must be many processes inside NP++ that depend on being able to scan across an entire file in a sensible amount of time.

It might be more constructive if you described how you got into this mess, and someone might be able to offer some constructive suggestions!

As a preliminary suggestion, I would suggest that you write a C program to read the file, and recognise the data you want to change. You could then modify the program to actually perform a change (but I would back up your file before you start :) ).

David

Ricardo

woot
For editing a 10GB file, I guess the best would be an editor that doesn’t load the entire file into memory, but reads it direct from disk. Otherwise, you would need a lot of RAM.

guy038

Hello Maor Bachar and All,

Why don’t you give a try to the old, but excellent, script program GAWK.exe, which can be used, in addition, both, on Unix or Windows machines (with the appropriate executable file, of course ) ?

You can get the latest Windows version of gawk.exe, AFAIK, ( v4.1.0 ), at the address, below :

https://code.google.com/p/gnu-on-windows/downloads/detail?name=gawk-4.1.0-bin.zip&can=2&q=

To get an overview of the main features of gawk and what’s new in the v4.1.0 version, follow the link :

http://www.drdobbs.com/open-source/gnu-awk-this-is-not-your-fathers-awk/240158351

And, here is, below, the link to download the last reference book, on GAWK, v4.1.x, in various formats :

http://www.gnu.org/software/gawk/manual/

The PDF form is quite recent : April 2015 !

To test it, I created a 1 Gb file. ( I didn’t create a 10 Gb file, as my old Win XP laptop, with only two 40Gb partitions would not accept it ! ). But, when I go back to work , as I’m presently, on holidays, I’ll be able to build bigger files !

Briefly, I first created, in N++, a 200 Mb file, about, named xxx.txt. Then, I recopied this file, five times, to the All.txt file, in a DOS windows, with the commands :

type xxx.txt>All.txt
type xxx.txt>>All.txt
type xxx.txt>>All.txt
type xxx.txt>>All.txt
type xxx.txt>>All.txt

I ended the All.txt file with a last line

echo END of the FILE>>All.txt

Length of the lines are from 0 to 150. Finally, the All.txt file contains, about, 18 498 000 lines, for 1026 Mo bytes !

I used the simple gawk script below, named Script.txt. It’s IMPORTANT to note that this file is ANSI encoded

#----------------------------------------------------------------------------------------------------------------#
#  SYNTAX to RUN, in a CMD windows, from the FOLDER, where are the 3 files GAWK.exe, Script.txt and All.txt      #
#            ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯  ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯                        ¯¯¯¯¯¯¯¯  ¯¯¯¯¯¯¯¯¯¯     ¯¯¯¯¯¯¯      #
#                                                                                                                #
#  SYNTAX :                                                                                                      #
#  ¯¯¯¯¯¯                                                                                                        #
#                                                                                                                #
#    echo [Regex|String]|gawk -f Script.txt - All.txt[ File2.txt[ File3[ ...]]][ >[>]See.txt]                    #
#----------------------------------------------------------------------------------------------------------------#

{
  if ( pattern == "" )
    {

      #---------------------------------------------------------------------------#
      #  The STANDARD INPUT -  is read, BEFORE the USER FILE(S), so ALL the TEXT  #
      #    of the DOS COMMAND 'echo' will be STORED in the VARIABLE 'pattern'     #
      #                                                                           #
      #   Note : No BLANK, at END of "echo" command, unless part of the REGEX     #
      #                                                                           #
      #  IF the PATTERN is NOT initialized  :                                     #
      #    SET the VARIABLE 'total' to the value 0                                #
      #    SET the VARIABLE 'pattern' to the value of the LINE field $0           #
      #      = ALL the TEXT of the DOS COMMAND 'echo'                             #
      #    SKIP to the NEXT line to READ                                          #
      #---------------------------------------------------------------------------#

      total = 0 ; pattern = $0

      next
    }
}

#------------------------------------------------------------------------------#
#    IF the CURRENT line MATCHES the PATTERN of the VARIABLE 'pattern' :       #
#      INCREMENT, by ONE, the VARIABLE 'total' [ and PRINT the CURRENT line ]  #
#    SKIP to the NEXT line to READ, in ALL cases                               #
#------------------------------------------------------------------------------#

{
  if ($0 ~ pattern) { ++total }

#   OR     if ($0 ~ pattern) { ++total ; print }

  next
}


#-------------------------------------------------------------------------------------------------------------------#
#  AFTER ALL the USER files are READ, DISPLAYS the NUMBER of lines, MATCHING the PATTERN of the VARIABLE 'pattern'  #
#-------------------------------------------------------------------------------------------------------------------#

END { print "\n  Number of lines MATCHING \x22" pattern "\x22  : " , total }

The command echo .*|gawk -f Script.txt - all.txt give me the number 18 498 231, which is the number of lines matching the regex .*

The command echo END of the FILE|gawk -f Script.txt - all.txt give me the number 1, noticing that it correctly read the last and 18 498 231th line, of the 1 Gb file All.txt !

On my old Win XP laptop, with, only, 1GB RAM ( 2 * 512 Mo ), I got the result, from each command, in 1mn 30s about ! Not to bad, isn’t it ?

Of course, my script just count the matchings but, with GAWK, you can manipulate files, in multiple ways, do calculus or searches/replacements or … It’s a very powerful tool, although tiny in size : 223 246 bytes, for the v4.1 version.

You’ll just have to scan, a bit, the gawk documentation !

And, as Ricardo said, it seems, to directly read files, from disk :-) => No limit in size, seemingly

Best Regards,

guy038

Klaus Lehmann

@Ricardo
I think crisp (64bit) can do it!
but it costs 100-250 per Year!

and for my biggest files (with rubbish-xml-gore) in 10GB files, I must to use: VEDIT Pro (64bit)
attention: high prize! $240 (no joke!)

Yours klaus