I have changed the neural network implementation; which was done in assembly; into C/C++.
This has as advantage that the compiler can inline the implementation, which results in faster code.
Below a part of the disassembled C/C++ version, compare these with the original
assembler version almost identical.
000002df <__ZL11dot_productPKsS0_i>:
2df: 0f 57 c9 xorps %xmm1,%xmm1
2e2: eb 15 jmp 2f9 <__ZL11dot_productPKsS0_i+0x1a>
2e4: 0f 28 04 48 movaps (%eax,%ecx,2),%xmm0
2e8: 66 0f f5 04 4a pmaddwd (%edx,%ecx,2),%xmm0
2ed: 66 0f 72 e0 08 psrad $0x8,%xmm0
2f2: 66 0f fe c1 paddd %xmm1,%xmm0
2f6: 0f 28 c8 movaps %xmm0,%xmm1
2f9: 83 e9 08 sub $0x8,%ecx
2fc: 79 e6 jns 2e4 <__ZL11dot_productPKsS0_i+0x5>
2fe: 0f 28 c1 movaps %xmm1,%xmm0
301: 66 0f 73 d8 08 psrldq $0x8,%xmm0
306: 66 0f fe c8 paddd %xmm0,%xmm1
30a: 0f 28 c1 movaps %xmm1,%xmm0
30d: 66 0f 73 d8 04 psrldq $0x4,%xmm0
312: 66 0f fe c8 paddd %xmm0,%xmm1
316: 66 0f 7e c8 movd %xmm1,%eax
31a: c3 ret
00000358 <__ZL5trainPKsPsii.part.1>:
358: 66 0f 6e 44 24 04 movd 0x4(%esp),%xmm0
35e: 66 0f 61 c0 punpcklwd %xmm0,%xmm0
362: 66 0f 70 c8 00 pshufd $0x0,%xmm0,%xmm1
367: 0f 28 15 10 0d 00 00 movaps 0xd10,%xmm2
36e: eb 1e jmp 38e <__ZL5trainPKsPsii.part.1+0x36>
370: 0f 28 04 48 movaps (%eax,%ecx,2),%xmm0
374: 66 0f ed c0 paddsw %xmm0,%xmm0
378: 66 0f e5 c1 pmulhw %xmm1,%xmm0
37c: 66 0f ed c2 paddsw %xmm2,%xmm0
380: 66 0f 71 e0 01 psraw $0x1,%xmm0
385: 66 0f ed 04 4a paddsw (%edx,%ecx,2),%xmm0
38a: 0f 29 04 4a movaps %xmm0,(%edx,%ecx,2)
38e: 83 e9 08 sub $0x8,%ecx
391: 79 dd jns 370 <__ZL5trainPKsPsii.part.1+0x18>
393: c3 ret
I have embedded the new NN implementation into 'paq8hp12any' and 'paq8pxd_v6'.
Which leaves a few questions:
1) Is in the DMC model the handling of top correct?
There is allocated 'static Array<DMCNode> t(MEM*2)' while the handling is: 'if (top==MEM*2) threshold=512; if (top==MEM*3) threshold=768;'.
2) Is the use of static dictionaries fair?
'paq8hp12any' used a static dictionary while 'paq8pxd' uses a dynamic one.
But in the benchmark this is not encountered or am I mistaken?
3) While resolving a few compiler warnings in both implementations. I became a
little frustrated about the horrible implementation (excuse my French) of
'textfilter.hpp' and 'wrtpre.cpp'. Both are difficult to read, and have a large
collection of (tiny) mistakes. For example the 'bounds' are incorrect calculated
before sorting and encoding.
I decided to write a complete new implementation with preserving the main idea
of the XWRT algorithm. I tried to implement the algorithm with less code as
possible, in a straight forward way.
Can someway give comment about this implementation?
Are there new improvements in the XWRT algorithm that I missed?
Kind regards,
Marwijn