Results 1 to 10 of 10

Thread: atxd All-Terrain colorizing heX Dumper for data & code visualization

  1. #1
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts

    atxd All-Terrain colorizing heX Dumper for data & code visualization

    I've wished for decades I had a hex dumper that made various features of data pop out visually, so you could easily see strides, text-related stuff, and simple numeric regularities, but I never got around to writing one. The other day I realized how easy it could be if I just generate rudimentary html and let the browser handle the rendering, so I wrote a little Python program to do it, which is just a few dozen lines. It turns out to be even easier to use and cooler than I expected.

    I call it atxd, which might stand for what I said in the subject line, or the Austin, Texas heX Dumper, or something else. It can't stand for the Amazing Technicolor heX Dumper, because Technicolor is a valuable trademark owned by Technicolor SA.

    Anyhow, the basic shtik is to display each byte of a file as hex, like in a normal hex dumper, but to encode information about that value in two ways. The color of the little box visually indicates the numeric value of the byte, with white for 0, black for 255, and grays for values in between.

    The color of the hex characters for a byte tells you other things, namely:

    (1) whether it's a common x86 opcode value (green)
    (2) whether it's a space or an English letter (red) or digit (orange), or
    (3) something else, in which case it's usually blue.

    You can vary the line width in two ways, and that's very useful. If you pick a line width that's a multiple of several common strides, and you have strided data, the stride is usually apparent---e.g., if your line width is 60, strides of 2, 3, 4, 5, 6, 12, 15, 20, 30, and 60 will usually show up clearly (assuming they go on long enough) because the pattern will wrap around and line up under itself.

    You can also usually see strides that are a little off from those, because you get diagonals. Something that repeats at an interval that's a little less than 60 will wrap around and fall just short of lining up under itself, so it will move left, and something that's a bit to long will pass itself and line up under itself and to the right. When you see how many spaces the pattern is from being lined up vertically, you can figure out the stride.

    Here's an example, using sao, the star catalogue file from the Silesia Corpus. I picked a stretch of a few thousand KB at random,
    and plotted it with a line length (picture stride) of 60, my usual default for a first try. It looked like this:

    Click image for larger version. 

Name:	sao_s1_x60.png 
Views:	1093 
Size:	569.1 KB 
ID:	2905

    At first I didn't see the stride, but after I looked at it a bit, I thought I could see some streakiness going downward to the left at a shallow angle, maybe a few bytes to the left each row down, which would indicate some stridiness in the mid-50's. I took off my glasses and backed up a few paces from my monitor, and it seemed a bit clearer---light bluish streaks with beige edges. (It may be clearer in the thumbnail above than it was on the screen full-size.)

    I thought maybe that would be clearer at a shorter stride, so I tried 30, which is very easy---all you have to do is narrow your browser window until all the lines wrap in the very center. (The hexdump is just an html text file with a fixed-width font environment, so it wraps perfectly.)

    That definitely made it clearer
    Click image for larger version. 

Name:	sao_s1_xd30.png 
Views:	516 
Size:	326.0 KB 
ID:	2907
    Now we have light-colored bluish streaks going down and to the left about two bytes per line. Looking closely, you can find values that repeat exactly that way, over and over (e.g., hex 15). So our stride clearly is two less than the picture stride of 30, i.e., 28.

    Rerunning atxd with a line length of 28, I got this:

    Click image for larger version. 

Name:	sao_s1_xd28.png 
Views:	627 
Size:	312.0 KB 
ID:	2908

    From that picture you can see a lot of the same regularities I mentioned in an earlier thread, which I'd found using fv and simple filters. Either way, you can make a good guess that you're looking at mostly little-endian numbers, where the columns representing the high bytes of numbers are less variable than the ones representing the low bytes. This looks like 28-byte records that starts with two 8-byte numbers LSB first, the second one more variable than the first and ends with two 4-byte numbers, also LSB first. In between there are 4 more bytes to figure out.

    With the dumper, the first two of those are very easy---a bit to the right of the midline of the picture you have one column of red numbers on lightish gray, suggesting capital letters (lower-case letter boxes are darker because of their higher codes). Just to the right of it, you have another column of light gray, but with yellow hex digits, suggesting characters. Looking closely, you can see that the letter field values are all from 41 to 4B, or A-K. (At least in this sample.) The digits are hard to read because they're yellow on a similar gray---I need to change the color scheme a little, I guess---but they're within the range of 0-9 or they wouldn't all be yellow.

    The other two columns to the right of that, are likely a two-byte number, LSB first, so that the first one varies way more than the second one (the high byte).

    There's more to glean from this picture---e.g., noticing that when the slowly-changing columns of the first 8 byte number change, they always go up, suggesting that it's monotonically increasing, and likely the primary sort key for the data set.

    Next I'll try examining some machine code.
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	sao_s1_xd60_halved.png 
Views:	262 
Size:	324.7 KB 
ID:	2906  

  2. #2
    Member Bloax's Avatar
    Join Date
    Feb 2013
    Location
    Dreamland
    Posts
    52
    Thanks
    11
    Thanked 2 Times in 2 Posts
    It would be great if you could avoid having a white background if you have combinations like dark blue on dark gray - because the contrast between the white background and the darker colors will make them much harder to read than they otherwise would be.
    Just compare this image on a white background to the same image with a dark background.

    For a concrete example, I can barely make out the "d7"s on this image.

  3. #3
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts

    Looking at machine code close up

    I was looking at some randomly sampled 5 KB chunks of x86 executables and dll's, trying to figure out how to compress 4KB pages of machine code for compressed virtual memory caching.

    Here's what a fairly typical one looks like in fv, straight, no filter:

    Attachment 2909

    It's not pretty. It mostly looks a lot like typical not-formatted text, only worse. It mostly very short string matches (one or two bytes, black and red) at not very short distances. (BTW, it's normal for the fv picture to be fairly stark and gritty looking when you're looking at small amounts of data.) It looks a lot like a high-entropy low-order ergodic Markov source. Bummer.

    That's not too surprising given that it's a variable-length code, so you get lots of short words, and you also have bits that are likely to differ between otherwise identical instructions, like the least significant bits of literals, and that keeps you from having many longer matches of sequences of instructions. It's like trying to compress text made of short words, where somebody's altered a few bits of many of the words.

    In the picture above, there's some other stuff going on. There are horizontal line segments where you get consecutive or almost-consecutive matches at a fixed distance, like a string being mostly copied from one place to another. Many of these are at shortish match distances, from less than 20 to over a thousand bytes. There's a big long one in this picture, at 80-something, just to the right of the midline of the picture.

    One of the reasons I wanted a colorizing hex dumper is so that I could see mysterious features like that in an fv plot, and quickly figure out what they actually are---are they riffs of code that repeat almost verbatim, or jump tables, or tables of literals and not actually code, or what? If they're literals, are they numeric, or text, and in what format?

    So I ran that sample through atxd to look for strides and such, and with my usual column width of 60, it looked like this:

    Click image for larger version. 

Name:	dlls5KB_xd60.png 
Views:	1578 
Size:	432.2 KB 
ID:	2910

    That's how x86 machine code normally looks in an atxd. Unlike typical Latin-1 text, there's lots of random variation in how light or dark the color boxes are, because the common values include 00 (white), 256 (black), and all sorts of values in between. The lettering colors vary a whole lot too, but with a big fraction of green, because that's the color I chose for the most common half dozen x86 opcodes. You can tell at a glance that it's x86 code, and if there are significant stretches of non-code, you can usually see that too---you have few or no green-lettered bytes.

    You can see a few other things easily in this picture, like frequent runs of a few black- or white-boxed bytes, where (if you look closely) you can see that all but the first byte is FF (in black runs) or 00 (in white runs). Those are small binary values in larger fields, such that the higher bytes are all 00 if they're positive, or all FF if they're negative.

    Unfortunately, in this picture it's not easy to see many obvious strides. Maybe the lines are too long, and short strides that repeat a few times are not being made obvious... So I cut the line length in half by narrowing the browser window so that each line wrapped after 30 bytes and... didn't get much. A quick scroll through didn't show obvious interestingly-patterned behavior.

    So I narrowed it some more so that each line would wrap after 20 bytes, and I thought it looked a bit more interesting, but still hard to see what was going on. So I narrowed it again to 15, and scrolled through it. Then I saw stridy patterns in a bunch of places. Here's one:

    Click image for larger version. 

Name:	dlls5KB_xd60_quartered_ex1.png 
Views:	307 
Size:	117.8 KB 
ID:	2911

    If you look at the middle of the picture, to the left of the scroll bar handle, you can see where a few lines line up at the picture stride of 15... so there's at least one stride 15 pattern here. It actually starts with the 3 bytes at the end of the row just to the left of the top off the scrollbar handle, (6A 05 6. Following that pattern (at the beginnings each of the next three lines) is what may be three literals with the similar-but-not-identical values 00 00 02 74, 00 00 02 75, and 00 00 02 76, but with the bytes reversed because it's little-endian, or 628, 629, and 630 in decimal. Those one-or-two-bit differences are what would keep an LZ-style compressor from matching this as three repeats of a 15-byte string.

    Not yet being fluent in x86 machine code in hex, I can't say for sure that this is repeating code, vs. some kind of repeating record structure of literals. (One complication is that FF is both a common literal byte value and a very common opcode. I don't display it as green, like some less-common opcodes, but maybe I should.)

    A line or two below that, starting a line below the bottom of the scrollbar handle, is a checkerboarded pattern, with alternating white boxes with black characters and lightish gray boxes with red characters. That's what a Latin-1 16-bit Unicode string looks like in atxd---alternating white-with-black and gray-with-red characters, which make a checkerboard rather than odd/even columns because our picture stride is odd. (The string says "EnableManualMBRControl" if you cut and paste it into something that can translate hexdumpese to Latin-1 letters.)

    Scrolling slowly through the stride-15 hexdump, I found another place with some clear strides:

    Click image for larger version. 

Name:	dlls5KB_xd60_quartered_ex2.png 
Views:	316 
Size:	122.5 KB 
ID:	2912

    If you look near the top of the picture, to the left of the top of the scrollbar handle, you can see where one stride ends and another starts. Going up there are three occurrences of a pattern that shifts left one byte as you go up. (See the black boxes of FF's in a diagonal going one byte right for every row down.) That's a stride of 1 more than 15, so 16.

    After that, there's another stride of 13, where you can watch the black FF's march left 2 bytes per row, then wrap around to the right and keep doing it. That 13-byte pattern repeats about 9 times. (It's easier to see if you look at the white boxes with black 08's in them, since they just march from the far right down and to the left.)

    Now that we can see that there are both odd an even strides in at least the range of 13 to 16, we could re-do the pictures at those strides to make things clearer, but I won't for now.

    What we haven't done is identify what's going on with the big stride of 80-something, which goes on the longest, just to the right of the midline of the picture.

    It's hard to judge the exact match distance of that horizontal line from the log scale of the fv plot, but I managed to guess right---it's 84. (If I'd been a little off, say 82 or 86, I'm sure the pattern would have been very obvious anyhow, and obvious how it was wrong, so it'd only take one more try to get it right.)

    Click image for larger version. 

Name:	dlls5KB_xd84.png 
Views:	470 
Size:	489.2 KB 
ID:	2913

    From the picture we can see that there's an 84-byte pattern that repeats almost literally about 5 times, and then for some reason shifts right one byte in mid-line and repeats about two more times, because there's an extra FF there. Looking closely, we can see that most columns have the same exact value repeated every time, but a few columns don't, which would break up very long string matching. So the 84-byte string pattern varies a little in a few places, and the stride varies by just one byte in one place, undermining absolutely rigid stride matching.

    Why? I don't know yet... I need to cut and paste into a disassembler, I guess.

    Comments?

  4. #4
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by Bloax View Post
    It would be great if you could avoid having a white background if you have combinations like dark blue on dark gray - because the contrast between the white background and the darker colors will make them much harder to read than they otherwise would be.
    Just compare this image on a white background to the same image with a dark background.
    Yes. (And thanks for the feedback.) I need to at least make some fine adjustments of at least the line and character spacings, and maybe the overall background color, as well as the exact default color mappings for foreground and background. This version 0.001a, which I just got to work and have been playing with for a day or so.

    I kind of like having the boxes spaced apart a little bit, but maybe a closer spacing and/or a medium-gray background would help. Presumably that can be done. (I know only the most rudimentary html tags---nothing fancier than basic headings, boldface and italic, and big and small tags---and had to look up how to do colored letters and boxes.)

    Eventually I want to have a choice of color maps, e.g., once you know something doesn't contain x86 machine code, you don't need those distracting green letters meant to make x86 code visually obvious. Likewise for red letters for ascii letters/numbers. Once you know you
    not dealing with x86 or Latin-1 text, you can use the foreground coloring to convey much more useful numerical information (or whatever).

  5. #5
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Ugh, the first plot include seems not to have worked, though it showed up fine in the editor, and I'm a bit afraid to edit the post... usually that breaks all the other includes and I have to redo everything and hope it works.

    Here it is the fv plot again for reference; maybe it will work this time.

    Click image for larger version. 

Name:	dlls5KB_fv.png 
Views:	323 
Size:	13.5 KB 
ID:	2914

  6. #6
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts

    atxd source

    (EDITED)

    Version 0.001 alpha of atxd is attached to this post. (Assuming it worked.)
    Attached Files Attached Files
    Last edited by Paul W.; 14th May 2014 at 14:28. Reason: fix attachment

  7. #7
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Try zipping it.

  8. #8
    Member
    Join Date
    Jun 2013
    Location
    USA
    Posts
    98
    Thanks
    4
    Thanked 14 Times in 12 Posts
    or use pastebin.

  9. #9
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Fixed, thanks.

  10. #10
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts

    0.002 alpha source

    Version 0.002 alpha is attached (assuming it works). The only interesting difference is that the overall background is gray rather than white, which makes it easier to read dark(ish) colors in dark(ish) gray boxes. (Hat tip to Bloax for feedback.)
    Attached Files Attached Files

Similar Threads

  1. loseless data compression method for all digital data type
    By rarkyan in forum Random Compression
    Replies: 244
    Last Post: 23rd March 2020, 16:33
  2. Arithmetic coding broken in IJG-code
    By thorfdbg in forum Data Compression
    Replies: 1
    Last Post: 10th September 2012, 21:22
  3. Generator matrix for linear code
    By azizever83 in forum Data Compression
    Replies: 0
    Last Post: 9th June 2012, 08:37
  4. Huffman code generator
    By Shelwien in forum Data Compression
    Replies: 2
    Last Post: 24th May 2011, 02:50
  5. Code Optimisation
    By Cyan in forum Data Compression
    Replies: 18
    Last Post: 18th January 2010, 00:48

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •