Project sources: https://github.com/tarsa/TarsaMatchFinder
Description:
TarsaMatchFinder outputs matches which can be used in LZ77-style algorithms. Unlike other approaches ( https://encode.su/threads/2710-LZ-M-...ll=1#post51865 ) it doesn't only output longest match per each position. Instead, for every position and match length in specified interval (between min-match and max-match inclusive) it outputs a match (if there's a match for that position and length) with smallest offset.
In order to reduce the memory usage during match finding the matches are filtered before saving and then a linear interpolation process recovers the filtered matches at a very small cost.
Requirements:
- Java 8,
- SBT (Scala Build Tool) or Lightbend Activator (which is a wrapper over SBT),
Sample usage:
Source code isn't yet very convoluted so should be relatively easy to follow. To understand what's going on I recommend starting from verifier, then going to brute force match finder, then interpolator and then TarsaMatchFinder (which is a trickiest part of the whole toolkit).Code:piotrek@piotr-desktop:~/projekty/TarsaMatchFinder$ java -version java version "1.8.0_121" Java(TM) SE Runtime Environment (build 1.8.0_121-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode) piotrek@piotr-desktop:~/projekty/TarsaMatchFinder$ ll corpora/ razem 106720 drwxrwxr-x 2 piotrek piotrek 4096 lut 18 14:34 ./ drwxrwxr-x 9 piotrek piotrek 4096 lut 18 15:18 ../ -rw------- 1 piotrek piotrek 768771 lip 15 1991 book1 -rw-rw-r-- 1 piotrek piotrek 18217472 lut 18 15:15 book1.flt -rw-rw-r-- 1 piotrek piotrek 65442800 lut 18 15:15 book1.int -rw-rw-r-- 1 piotrek piotrek 1066978 lut 18 14:33 calgary.zip -rw-rw-r-- 1 piotrek piotrek 100000 lut 17 01:11 enwik5 -rw-rw-r-- 1 piotrek piotrek 1229072 lut 18 14:30 enwik5.flt -rw-rw-r-- 1 piotrek piotrek 11336848 lut 18 14:02 enwik5.int piotrek@piotr-desktop:~/projekty/TarsaMatchFinder$ alias activ alias activ='~/devel/activator-1.3.12-minimal/bin/activator' piotrek@piotr-desktop:~/projekty/TarsaMatchFinder$ activ [info] Loading global plugins from /home/piotrek/.sbt/0.13/plugins [info] Updating {file:/home/piotrek/.sbt/0.13/plugins/}global-plugins... [info] Resolving org.fusesource.jansi#jansi;1.4 ... [info] Done updating. [info] Loading project definition from /home/piotrek/projekty/TarsaMatchFinder/project [info] Set current project to TarsaMatchFinder (in build file:/home/piotrek/projekty/TarsaMatchFinder/) > run [info] Updating {file:/home/piotrek/projekty/TarsaMatchFinder/}tarsamatchfinder... [info] Resolving jline#jline;2.14.1 ... [info] Done updating. [info] Compiling 7 Scala sources to /home/piotrek/projekty/TarsaMatchFinder/target/scala-2.12/classes... [info] Running pl.tarsa.matchfinders.Main Please specify a command Available commands (case sensitive): help displays this help find-matches <input> <finder> <min> <max> <compacted> finds matches in input file and stores the essential ones input: input file with original data finder: match finder, one of: bfmf: brute force match finder tmf: Tarsa match finder min: minimum match size, min >= 1, min <= max max: maximum match size, max >= min, max <= 120 compacted: file where filtered matches will be written interpolate <input> <compacted> <interpolated> reconstructs full set of matches from essential ones input: input file with original data compacted: file with filtered matches interpolated: file where full set of matches will be written verify <input> <interpolated> uses brute force match finder to verify presence of all matches input: input file with original data interpolated: file with full set of matches [success] Total time: 4 s, completed 2017-02-18 15:15:05 > run find-matches corpora/book1 tmf 3 120 corpora/book1.flt [info] Running pl.tarsa.matchfinders.Main find-matches corpora/book1 tmf 3 120 corpora/book1.flt [success] Total time: 1 s, completed 2017-02-18 15:15:14 > run interpolate corpora/book1 corpora/book1.flt corpora/book1.int [info] Running pl.tarsa.matchfinders.Main interpolate corpora/book1 corpora/book1.flt corpora/book1.int [success] Total time: 0 s, completed 2017-02-18 15:15:21 > run verify corpora/book1 corpora/book1.int [info] Running pl.tarsa.matchfinders.Main verify corpora/book1 corpora/book1.int Verification OK [success] Total time: 801 s, completed 2017-02-18 15:28:47 >
Beware that the memory usage is quite high and performance isn't yet impressive. I recommend starting experiments with small files (ten of kilobytes) and not exceeding one megabyte. If you don't tune JVM memory settings then you can run into memory exhaustion and this is not easy to spot, as before throwing OutOfMemoryError the JVM sometimes runs the multithreaded GC for a long time. This program is single threaded so if you notice it using many cores then probably you've run into memory exhaustion and GC went crazy. JVM doesn't try to use all of the available memory for the Java objects - instead it has own heap with a fixed size limit set at JVM startup.