Results 1 to 5 of 5

Thread: Best Compression Algorithm for a File Hosting service?

  1. #1
    Join Date
    Nov 2011
    Mostly behind my Computer
    Thanked 0 Times in 0 Posts

    Best Compression Algorithm for a File Hosting service?

    Hello there, everyone.

    While I was browsing the internet, I became very sad of all the big File Uploading websites where you have to pay to actually download content fast, or have a free account with Waiting Times, slow download rates, an AWFUL amount of advertisements and only a few downloads a day, etc.

    So I was thinking about opening my own File Hosting Service. Now of course I understand that needing such a big amount of space to host the files on can easily become very expensive, and that those other File Hosting services need to do all these advertisements etc. to actually still get some money. However, I had another idea how this might be fixed.

    What if you compress all files on a webserver?
    If I am right, most compression algorithms tend to compress data better if you have a bigger amount of data ( more data -> higher chance on duplicate data). So if you compress all files on the webserver together, this could really save tons of space. For instance, if a user uploads a file that already exists on the server, the duplicate file should take almost no extra space.

    However, right now I'm thinking what algorithm would work best in this kind of situation. I'm new to Compression techniques, and altough I've read Wikipedia(and some .PDF's linked to from Wikipedia) and I have a general understanding of Math and Programming, I'm still a Rookie.

    I need an algorithm that:
    -Is able to compress data better when there is a LOT of other data. I.e., algorithms with a Sliding Window are probably not the best.
    -Files should be able to be compressed and decompressed independently.
    -The compression/decompression time shouldn't be too long.

    This tends to make me think about Context-based compression techniques.

    I was thinking about DMC, but I don't know if it really is a good idea. Also I've read the Wikipedia article of PAQ, but I still don't understand how it works internally.

    Could somebody please help me?

    Have a great day!


  2. #2
    Member m^2's Avatar
    Join Date
    Sep 2008
    Ślůnsk, PL
    Thanked 65 Times in 47 Posts
    In order to decompress a file you need to have the context uncompressed. You don't want to store the context uncompressed, because it takes a lot of space. So you need to decompress it when accessing each file. Therefore if you want to be able to decompress files individually, DMC, CM(PAQ) etc. won't be able to use other files as the context. What you probably want is deduplication (as smart as possible) mated with any compression algorithm that meets your speed/strength requirements.
    One useful thing would be splitting files out of containers, i.e. out of zips.
    If I were you, I'd talk with Ocarina Networks (now Dell), last time I checked they got the smartest dedupe engine around and good compression to back it up. They have some container parsing, but I'm not 100% sure that it's lossless.
    Though is lossless compression really needed?
    Image hosting sites recompress what you upload to them. Would it be so bad to replace one zip with another, that has exactly the same contents and just a different binary representation?
    Last edited by m^2; 25th November 2011 at 20:28.

  3. #3
    Join Date
    Feb 2010
    Thanked 36 Times in 12 Posts
    A good approach just to get started is ZFS with de-dup and compression enabled. The big question is what kind of data it is, if it is in archives, and if it is password protected. Know your data. Consider also what kind of reliability guarantees you will be giving.

  4. #4
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Melbourne, Florida, USA
    Thanked 798 Times in 489 Posts
    One problem is that people will probably upload already compressed files like zip or 7zip archives. Otherwise you could use an existing compressor like 7zip rather than write your own.

    The Dell/Ocarina DX6000G will compress better (has some of my software in it) and deduplicate, if you don't mind spending around $20K. The compression is transparent (and lossless). It just looks like a file system with more space.

  5. #5
    Join Date
    Nov 2011
    Mostly behind my Computer
    Thanked 0 Times in 0 Posts
    Thanks a lot for the fast replies! I hadn't heard of the term deduplication before, so that really helped me forward. The Ocarina Networks solution looks great, but sadly I don't have $20K to spare ^^'. Anyway, thank you very much .

Similar Threads

  1. Hierarchy compression algorithm and more
    By teddybot in forum The Off-Topic Lounge
    Replies: 7
    Last Post: 3rd May 2012, 02:16
  2. Compression test file generator
    By Matt Mahoney in forum Data Compression
    Replies: 3
    Last Post: 26th June 2011, 22:28
  3. New layer 0 - compression algorithm
    By abocut in forum Data Compression
    Replies: 5
    Last Post: 28th May 2010, 02:32
  4. The best algorithm for high compression
    By Wladmir in forum Data Compression
    Replies: 8
    Last Post: 18th April 2010, 15:54
  5. my file compression considerations
    By JB_ in forum Data Compression
    Replies: 2
    Last Post: 5th May 2008, 20:47

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts