Anyone used x86 assembly in .NET?

**TheMonkey** · 4 August 2006, 19:47

Is that the 1.1 or 2.0 runtime? I think they use a fixed register assignment (C-style runtime) for .Net which is far from optimal.

I'm actually writing something in Win32Forth at the moment which is interesting - compiles natively to X86 assembly and allows you to intermingle assembly and forth inline. Super-efficient!

**AtW** · 4 August 2006, 20:34

Its VS2005, so its .net 2, and worse still - that particular code is actually disassembly of the same code ported to Visual C++: I was expecting more from C++ compiler, heck, I intentionally put some stuff into local vars giving hint to compiler that they can't be accessed from elsewhere so its safe to put them into registers. Now I am writing down registers allocation manually and going to put them into _asm bits.

The most annoying thing is that as soon as I add asm statement the damn thing prevents me from debugging that piece of code, I can't step into it ffs.

**VectraMan** · 4 August 2006, 20:46

Forth?

I know little about .NET, but is it really possible to write x86 assembly - other than interfacing with some other external module? I understand you can write assembly for it's version of bytecode, but surely not x86?

And as for optimisation, you realise that these days storing everything in registers to avoid memory read/writes doesn't give you the gains you might expect? Cache read/writes are just as fast as registers, and the "registers" you see in x86 assembly aren't really even registers anyway. Local variables are on the local stack which always ends up in the cache, so to all intents and purposes local variables are just as good as registers.

I doubt you'll get a 500% increase of anything without a radically different algorithm. I spent a lot of time optimising graphics routines in assembler, and the main reason was to use MMX/SSE/3dNow instructions to do two or more operations simultaneously, which obviously makes quite a difference. Otherwise there's not that much difference to C++ code (with optimisations turned off).

**AtW** · 4 August 2006, 20:51

LOL, just look at the disassembly of the code generated by Visual C++ (from VS 2005):

// unsigned __int16 usValue=*(unsigned __int16*)pBuf;
000000ce 0F B7 07 movzx eax,word ptr [edi]
000000d1 89 45 DC mov dword ptr [ebp-24h],eax

// if((usValue & 0x8080)==0)
000000d4 F7 45 DC 80 80 00 00 test dword ptr [ebp-24h],8080h
000000db 0F 85 88 00 00 00 jne 00000169

So I want to read word and then check if mask test fails then go elsewhere, so the code generated is load from pointer data into eax, then update bloody var on stack, then (and its real

) it tests not the variable value in EAX (!) but its memory value, just how crap code generation has become these days ffs!!!!

Its a local variable btw, absolutely no need to test it from memory after just having loaded it into EAX and then update memory, ffs!

The value was read into local var because it is being manipulated down the line, by making it in a local var I was hinting compiler that its safe to put it into register.

--------------

Vectra: its possible to call unmanaged code written in C++ from C# (or VB.net for that matter), I write _asm in C++ module, so it all works okay apart from strange refusal by VS 2005 to step over that function with ASM in it, IMO weird bug.

I expect very hefty speed up from changing of basically 10 lines of code into hand optimised asm that uses registers effectively - while it maybe true that caching of local stack vars is good, still memory accesses are performance killer, given register renaming it is much easier to feed more than one instruction into CPU as they (x86) are superscalar ever since original Pentium.

The cache can be trashed because I already read memory in the loop (sequentially though), will keep you posted on performance improvement, it has got to be 3-5 times better.

Naturally I have already changed algorithms to maximise gains from things other than having to resort to assembly.

**VectraMan** · 4 August 2006, 22:37

Originally posted by AtW

// unsigned __int16 usValue=*(unsigned __int16*)pBuf;
000000ce 0F B7 07 movzx eax,word ptr [edi]
000000d1 89 45 DC mov dword ptr [ebp-24h],eax

// if((usValue & 0x8080)==0)
000000d4 F7 45 DC 80 80 00 00 test dword ptr [ebp-24h],8080h
000000db 0F 85 88 00 00 00 jne 00000169

But I think you'll find the read and write to ebp-24 will make no difference, in fact it'll probably never happen (if something else replaces it) as the processor is clever enough to work out that [edi], eax, and [ebp-24] are all the same thing and treat it as one chunk of data internally. Crazy but true. At worse you're wasting memory in the code page with instructions you don't need.

Despite what people may think of Microsoft, I don't think they would have put all the effort they have into .NET and had it produce slow code that could easily be optimised.

Having said that, the main thing I learnt from doing assembler optimisations is to make no assumptions and test everything. Sometimes your brilliant scheme to speed things up makes things slightly worse, and sometimes simple things that you don't think will make a difference gain you 20%.

I think I remember reading something about not being able to debug into unmanaged code.

**AtW** · 4 August 2006, 23:37

Unmanaged code in C++ gets debugged fine, but as soon as I use _asm directive in function it wont get debugged, think might be special case

Memory writes (even from local stack) ain't free - it takes time to decode instructions that are long and even though they are cached still bandwidth on bus has to be used to read/write data. In my case I have rewritten loop to only read from memory buffer (can't avoid that) and do the rest in registers, given pipelining this should be much faster.

I will report tomorrow improvements in speed that I get in rewritten loop. I will be deeply disappointed if its less than 500%.

**mcquiggd** · 4 August 2006, 23:45

Alexei, I bet you don't reach more than 450%......

**AtW** · 4 August 2006, 23:51

How much you proposing to bet?

**VectraMan** · 5 August 2006, 00:16

Originally posted by AtW

Memory writes (even from local stack) ain't free - it takes time to decode instructions that are long and even though they are cached still bandwidth on bus has to be used to read/write data. In my case I have rewritten loop to only read from memory buffer (can't avoid that) and do the rest in registers, given pipelining this should be much faster.

Well if it's a tight loop executed many times, then the instructions will only be decoded once, and although memory read/writes aren't free, using memory as a temporary store is no more expensive than using registers as a temporary store if that memory can stay in the cache (it goes nowhere near the bus).

If you're reading a large amount of data and processing it say, unless the calculation is very complicated the bottleneck will be reading the large amount of data, and although you can make small gains with aligned reads etc., all the clever coding in the world won't make up for the relative slowness of the memory read.

But I don't know exactly what you're doing. It'll be interesting to see what you get. I'd bet you wouldn't get more than 10% assuming the same algorithm, and if your only approach is to remove the pointless read/writes to local memory I think you'll get 0% improvement. But I'm more than happy to be proved wrong.

**AtW** · 5 August 2006, 00:23

Originally posted by VectraMan

I'd bet you wouldn't get more than 10% assuming the same algorithm, and if your only approach is to remove the pointless read/writes to local memory I think you'll get 0% improvement. But I'm more than happy to be proved wrong.

The algo is the same but I make full use of registers with no more unnecessary memory reads/writes, plus instructions are re-arranged to assist pipelining. I was just looking at table of clock costs for different opcodes and mov from/to memory costs 3 times more than move to register: I don't buy that x86 processors of current generation will be able to convert that constant variable memory access to register behind scenes - if this was indeed the case then the code should have run much faster than it is now. I will post numbers tomorrow, but right now on paper it should run at least 3 times faster.

Anyone used x86 assembly in .NET?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Partners

Advertisers

Contractor Services

CUK News

Anyone used x86 assembly in .NET?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Partners

Advertisers

Contractor Services

CUK News

Tag Cloud