The entire Geocities site (famous for terrible web design 10 years ago!) is about to be released as a single, ginormous 900GB archive.

Much of the hassle in preparing the archive was compressing it!
This is going to be one hell of a torrent ? the compression is happening as we speak, and it?s making a machine or two very unhappy for weeks on end.
Just as a theoretical discussion, what would the best strategy be to compress such a huge amount of data? I don't even know the original format but I'd guess it'd be a very very large directory tree. And as a total guess I bet the compression the team is using is to isolate something like 1GB subtrees and feed each into ZIP or WinRAR.

Of course this would mean they're losing enormous efficiency since there's no hope of data in one subarchive helping compress data in a different subarchive...

It would be interesting (not important, but interesting) too see how different algorithms could deal with compressing such huge data.. kind of a ginormous enwiki8-like benchmark.
It's no longer insane to think about handling so much data (a 2TB hard drive is $100) so it might be interesting to think how to compress and access such an archive.

As a wild initial "practical and decent" strategy, streaming the whole archive as a flat tar-like file through 7zip would likely be a good tradeoff for speed and size. I pick 7zip because of its very large LZ77 like windows, allowing similar code from one website to be copied into later websites. (But a compressed tar-like archive wouldn't work for random access.. it's all tradeoffs!)

The link above goes to the site that will be hosting the torrent soon.