Monday, July 22, 2013

Gas Problems

I have been having some gas problems lately.

No, not this kind of gas, though that would actually be a lot funnier.  Maybe I need to back up a bit.

I have been playing around with the idea for a new project that would involve doing a little assembly language.  I've done x86 assembly "back in the day" when I contributed some code for Retrocade, an arcade game emulator that was the fastest thing going at the time.
Too Bad It Died
Back then I used NASM, an x86 assembler that is still very popular today.  But I've also toyed around with the idea of getting myself a Raspberry Pi and maybe doing a bit of assembly on it.  NASM is x86 only, so if I wanted to do anything on the Pi, I'd have to learn another assembler while learning another assembly language. That would kind of suck.

This brought me to think about FASM. FASM has a variant called FASMARM that can assemble ARM mnemonics. However, FASMARM doesn't actually run on ARM itself. It can only generate code on a PC that then would have to be transferred to the Pi. That would kind of suck too.

Anybody that does any assembly is aware of the GNU Assembler gas (often written as, well, "as"). gas definitely sucks. It is the only commonly used assembler that uses ATT syntax. In a nutshell, ATT syntax is:
a) backwards
b) full of superfluous % signs
This seminal article on IBM Developerworks provides all the gory details in comparing NASM to gas.  Or forget about beating around the bush and read this quote from the 646 page tome Assembly Language Step By Step:
 The GNU assembler gas uses a peculiar syntax that is utterly unlike that of all the other familiar assemblers used in the x86 world, including NASM. It has a whole set of instruction mnemonics unique to itself. I find them ugly, nonintuitive, and hard to read. This is the ATT syntax, so named because it was created by ATT as a portable assembly notation to make Unix easier to port from one underlying CPU to another. It’s ugly in part because it was designed to be generic, and it can be recast for any reasonable CPU architecture that might appear.
Suffice to say, gas is generally avoided by programmers like Amy Winehouse avoided rehab.
Should Have Gone Into Rehab

However (and there is always a however), the GNU gods saw fit to introduce a compatibility mode to gas that allowed the use of Intel style syntax, the type used by programs such as NASM and FASM. Sounds good. I decided to give it a shot.  I would do so by taking the NASM examples in the Developerworks article and see if I could get them to work in gas using Intel-style syntax.  The binutils documentation led me to believe that this should be a pretty straightforward process.  I was wrong.  There was some headbanging on the way and I nearly gave up at one point, but now I have everything working well.  This is what I learned along the way, presented as gists on gitub that you can dive into and fork as you please.

Here is Listing 1 from the Developerworks article in Intel style syntax and assembled using gas.  This luckily worked right off the bat. The key thing to note in this example is
.intel_syntax noprefix
That is the magic incantation that makes it all happen. This key tidbit from Stackoverflow is what got me going down this road in the first place.  This directive tells gas to use Intel syntax and to not need the % prefix before register names.  Put that in there and besides the directives specific to gas, the code itself looks quite a bit like NASM.

Then I got to Listing 2 and cruised through it to, mostly because it didn't introduce much in the way of new concepts.  This looked like it was going to be easy.
Next up was Listing 3. Listing 3 was a son of a bitch. This code introduced macros and some program strings and constants tucked away in the .data segment (I changed this to the read only data segment - .rodata - because this data was, well, read only).

The problem I had with Listing 3 was that I could not get the strings in .rodata to print to save my life.  Running the code in the debugger shows that I was getting the string lengths read properly.  The problem was getting the address of the strings that were then sent off to the write macro.  My read macro wasn't working right either.  I banged my head against a brick wall for an entire evening before I found this on the net the next morning.  The key text is in the code but I want to put it down again below to drive the point home.
The somewhat mysterious OFFSET FLAT: incantation tells the assembler to figure out the (4-byte) address where the variable x will end up when the program is loaded. Even the assembler does not have all the information to figure this out, since a program may be in several pieces and the assembler does not know where each piece will go. It is up to the loader to figure this out, so in fact all the assembler does with the OFFSET FLAT: reference is to make a note in the object file and it is the loader that will finally fill in the right value in the generated instruction. This is one of the respects in which object code (which ends up in a .o file after assembly) is not pure machine code.
Well, tie me to the side of a pig and roll me in the mud.  This and similar code code
mov ecx, \str
as it was in the Developerworks article would refuse to work, and no amount of BYTE PTR, brackets, dollar signs, or percent signs brought me any joy.  But this code worked just fine
mov ecx, OFFSET FLAT:\str
You'll see that what I actually ended up using here was
lea ecx, \str
just because this version is more clear (I'm trying to load the effective address of that string), I don't need "OFFSET FLAT:", and lea is generally a more powerful instruction for loading an address that I could end up using later.  I would need "OFFSET FLAT:" for a mov, push or any other instruction referencing a memory location in .data or .rodata.  Note that I don't need it to access constants for the string lengths in those .sections.  That is because the assembler can at least figure that much out on its own.  The code below works perfectly.
There are a couple things to note in Listing 3 above.  First, the gas syntax for macros is still used.  Variable substitution is made with backslashed references to the macro parameters.  Just because we are using Intel style syntax doesn't change that part of it.  No problem.  The macro syntax in gas isn't arcane like the ATT syntax itself.

The second thing to notice is how I have coded the string length.  The Developerworks example says I should use something like this
greet_str: .ascii "Hello " 
gstr_end: .set GSTR_SIZE, gstr_end - greet_str 
when in fact I have used this (thanks again, Stackoverflow).
greet_str: .ascii "Hello "
.set GSTR_SIZE, . - greet_str 
Either the Developerworks article was written before the "." was introduced to keep track of the current location or the author wasn't aware of it.  This part of the code becomes cleaner and more NASM like since the extra label "gstr_end" is not required.

On to Listing 4 and we're on a roll.  You'll see the liberal scattering of "OFFSET FLAT:" because of all the non-lea instruction references to the strings in the .data section.  There are a couple things a little different in this one though.  First (shown in the code and mentioned in the Developerworks article), the parameters passed to the linker instruction are changed to link the external C standard library.  Second is the use of "BYTE PTR" where NASM got away with just using "byte".  I tried a search and replace of "BYTE PTR" to "byte" and the program stopped working properly, though gas did not complain.  I could have experimented a bit more with this to see what was going on here but didn't bother.  My initial read of the gas docs suggested to me that "byte" should have done the trick.
Finally, we get to Listing 5.  This was smooth sailing.  Again there is liberal use of "OFFSET FLAT:" to reference memory locations in the .data section.  There is also the introduction of the .rept command to simplify the allocation of ten memory locations, but this translated over straight from the gas code in the Developerworks article without a hitch.
And that about wraps it up. Knowing a few rather arcane tricks makes gas much easier to work with.  gas is also everywhere there is a gcc compiler, so you can use it pretty much anywhere on everything.  All you need to know is a few tricks to make it more usable, and now you know those tricks.  Give it a shot.

2 comments:

  1. How does the nasm address es:[bx + 0x1a] transfer to gas?
    Ive done some looking around and the data inside the location [bx + 0x1a] would transfer as %bx(0x1a,1) right?

    ReplyDelete
  2. Great post. About the "OFFSET FLAT:", GAS accepts also just "offset" instead, for example:

    push offset usort_str

    ReplyDelete