I wish to very compactly encode the domain of each website a random Internet user visits, but find that the compression achieved by generic tools is insufficient.
It ought to be possible to derive better models both from the formal restrictions:
- each A-label must match ^[a-z\d]([a-z\d\-]{0,61}[a-z\d])?$; and
- the entire domain including '.' label delimiters must not exceed 255 characters
and from heuristics, including:
- lower-order U-labels are often lexically, syntactically and semantically valid phrases in some natural language including proper nouns and numerals (but stripped of all whitespace and punctuation except hyphen, and folded per Nameprep), with a preference for shorter phrases;
- higher-order labels are drawn from a dictionary of SLDs and TLDs and provide context for predicting which natural language is used in the lower-order labels; and
- statistics of website popularities (such as from Alexa or Netcraft), potentially refined according to some environmental context.
One thought is to build a Huffman coding of e.g. Alexa's list of the top 1m websites, together with an escape code for non-listed sites.
The question remains how best to encode such non-listed sites. I currently imagine using the same training set to build Huffman codes of every TLD and many popular SLDs; then choose from a predefined set an adaptive natural language model and initial state for the lower order labels, indicating which combination has been used with another Huffman code (perhaps itself in the context of the TLD and/or SLD).
However, I am a little unsure what type(s) of adaptive natural language models would work best (perhaps after applying dictionary transforms): my research to date has led me to PPM and DMC, although I'm not even sure that a BWT-based approach wouldn't be better for some languages. I would welcome your thoughts.
I would, of course, like to avoid reinventing the wheel if at all possible: so please do shout if existing tools might already meet my needs, or if I'm treading a long way from solid ground.