FastStringBuilder, a drop-in replacement for SysUtils.StringBuilder

I’m working on a pre-compiler for Delphi and as a result of that have to do a lot of string concatenations.
It’s reasonably fast to use plain old string concatenation, but using TStringBuilder turned out to be about twice as fast.

Emba is not known for writing efficient RTL code so I was wondering if we could speed up the generation (even) further.

The answer is FastStringBuilder, a drop-in replacement.
Building a large string by adding lots of small strings is always faster using FastStringBuilder.TStringBuilder. Note that the Win64 target has the lion share of optimizations. I plan to add additional Win32 optimizations in due course.

The degree of speed-up differs depending on the use case. I’m still working on a faster alternative for the calls to System.Move. Move can deal with overlapping moves, which I don’t need to worry about here.

You can download FastStringBuilder from

Here’s how to use the unit:

unit MyTest;

uses SysUtils, FastStringBuilder; //Make sure FastStringBuilder is used after SysUtils!

  TExample = class
    SB: TStringBuilder;
    function Test: integer;


function TExample.Test;
  i: integer;
  if not(Assigned(SB)) then SB:= TStringBuilder.Create;
  for i:= 0 to 100 * 1000 do SB.Append('apple');

Timings for a number of additions:

StringBuilder FastStringBuilder String concat
string 1531 1203 1781
char 188 47 1734
integer 1563 562 1656

Note that the test timing are somewhat worse case; if I shorten the test string from 'appleappleapple to 'apple' FastStringBuilder speeds up to 562, String-Concat and StringBuilder stay more or less the same.

The code used to obtain the above timings:

  SSB: SysUtils.TStringBuilder;
  FSB: TStringBuilder;
  S: string;
  LTick: cardinal;

  TestCount = 100 * 1000 * 100;

procedure Timings;
  Test: string;
  C: Char;
  i,j: integer;

  j:= 1457454;
  Test:= 'appleappleapple';
  SSB:= SysUtils.TStringBuilder.Create;
  FSB:= TStringBuilder.Create;

  LTick:= TThread.GetTickCount;
  for i:= 0 to TestCount do begin

  WriteLn('Standard SB: ',TThread.GetTickCount- LTick, 'ms');

  LTick:= TThread.GetTickCount;
  for i:= 0 to TestCount do begin
  Writeln('Fast SB: ',TThread.GetTickCount- LTick, 'ms');

  LTick:= TThread.GetTickCount;
  for i:= 0 to TestCount do begin
    S:= S + Test;
  Assert(S  '');
  Writeln('string: ',TThread.GetTickCount- LTick, 'ms');

  LTick:= TThread.GetTickCount;
  for i:= 0 to TestCount do begin
  Writeln('Standard SB: ',TThread.GetTickCount- LTick, 'ms');

  LTick:= TThread.GetTickCount;
  for i:= 0 to TestCount do begin
  Writeln('Fast SB: ',TThread.GetTickCount- LTick, 'ms');

  LTick:= TThread.GetTickCount;
  for i:= 0 to TestCount do begin
    S:= S + C;
  Assert(S  '');
  Writeln('String: ',TThread.GetTickCount- LTick, 'ms');

  LTick:= TThread.GetTickCount;
  for i:= 0 to TestCount do begin
  Writeln('Standard SB: ',TThread.GetTickCount- LTick, 'ms');

  LTick:= TThread.GetTickCount;
  for i:= 0 to TestCount do begin
  Writeln('Fast SB: ',TThread.GetTickCount- LTick, 'ms');

  LTick:= TThread.GetTickCount;
  for i:= 0 to TestCount do begin
    S:= S + i.ToString;
  Assert(S  '');
  Writeln('String: ',TThread.GetTickCount- LTick, 'ms');

Operator overloading in assembly – beware register juggling.

Since D2006 Delphi supports operator overloading.
You can use inline and assembly in your custom operators, which can really help performance when you’re in a pinch.

In Program Control (Delphi) we can see:

Handling Function Results
For static-array, record, and set results, if the value occupies one byte it is returned in AL; if the value occupies two bytes it is returned in AX; and if the value occupies four bytes it is returned in EAX. Otherwise, the result is returned in an additional var parameter that is passed to the function after the declared parameters.

However the rules for register allocation with record operators is a bit strange, so for documentation (as well as my own future reference) here are the rules:
When the class operator returns a result that is too large to fit inside a register, an additional var parameter Result is inserted before the other parameters of the method.

Let’s define a record that holds a set.

  TMySet = record
  strict private
     FData: array[0..31] of byte;
     class operator BitWiseAnd(const A, B: TMySet): TMySet;   //translates to a procedure with one var and two const parameters.
     class operator Equal(const A, B: TMySet): boolean;       //translates to a static class function

//Translates to class procedure TMySet.BitwiseAnd(var Result: TMySet; const [ref] A,B: TMySet); static;
class operator TMySet.BitWiseAnd(const A, B: TMySet): TMySet;   //translates to a procedure with one var and two const parameters.
  //rcx = @Result
  //rdx = @A
  //r8 = @B
  movdqu xmm0,[rdx]   //Delphi does not do alignment
  movdqu xmm1,[rdx+16]
  movdqu xmm2,[r8]
  movdqu xmm3,[r8+16]
  pand xmm0,xmm2
  pand xmm1,xmm3
  movdqu [rcx],xmm0
  movdqu [rcx+16],xmm1

//translates to class function TMySet.Equal(const [ref] A,B: TMySet): boolean;
class operator TMySet.Equal(const A, B: TMySet): boolean;       //translates to a static class function
  //rcx = @A  //How confusing
  //RDX = @B
  mov xmm0,[rcx]
  mov xmm1,[rcx+16]
  mov xmm2,[rdx]
  mov xmm3,[rdx+16]
  xor edx,edx            //
  pcmpeqq xmm0,xmm2      //xmm0=all ones if equal, zero's somewhere if not equal
  pcmpeqq xmm1,xmm3      //we have 2 comparisons per xmm-word
  pshufd xmm0,xmm1,$AA   //mix comparisons A and B
  pmovmskb eax,xmm0      //eax=$0000FFFF if all equal
  cmp eax,$FFFF          //is everything equal?
  mov eax,edx            //Assume not equal
  setz al                //if yes then put a true in result.


When we return a result that’s too large too fit inside the rax register, then the signature of the operator changes to a procedure with the result as the first parameter.
If the result fits inside a register, it is returned in rax as normal and the parameters work as usual with class functions.

Altering Delphi applications without access to the source code

A while ago I was asked to add some functionality to a Delphi app.

However the app is written in an old Delphi version; the company that wrote the app has long since disappeared and none of the contact people can be reached. Oh and there is no trace of source code anywhere.

Changing the application without writing code

Luckily the application uses SQL-server which is a nice multi-user database.
And there is the application itself. You can view and change the forms using a tool like Resource Hacker.  This allows me to change things like SQL queries embedded in the forms.
The client requested me to add new functionality. This was not difficult, I can just write a separate application that accesses the database and gives them the bells and whistles they need.

Integrating the new forms into the old application

Having separate applications for new and existing functionality is not a great option so I tried to integrate the two applications into one.

The following is needed to achieve this goal:

  1. Application that starts the old app and injects a dll (Seattle);
  2. dll that alters the old application and loads the new functionality (D7);
  3. dll with the new functionality (Seattle);
  4. test application with the forms in dll (3) so I can test the correct operation of the dll.
  5. A copy of the exact Delphi version that was used to write the original application (D7).

Step 1: Inject a dll that will alter the old application

program InjectOldApp;
  uStartAndHookOldApp in 'uStartAndHookOldApp.pas';
procedure HookOldApp;
  BytesWritten: SIZE_T;
  Process, Thread, ThreadId, hKernel: cardinal;
  pLoadLibrary, Parameters: pointer;
  DLL: AnsiString;
  HWndOldApp: HWnd;
  StartupInfo: TStartupInfo;
  ProcessInformation: TProcessInformation;
  Path: string;
  Path:= ParamStr(0);
  Path:= ExtractFilePath(Path);
  Dll:= Path + 'HookOldApp.dll';
  HWndOldApp:= FindWindowEx(0,0, nil, 'OldApp Window name');
  if HWndOldApp = 0 then begin
    FillChar(StartupInfo, SizeOf(TStartupInfo), #0);
    StartupInfo.cb:= SizeOf(StartupInfo);
    FillChar(ProcessInformation, SizeOf(TProcessInformation), #0);
    Path:= Path + 'OldApp.exe';
    OldAppProcessID:= ProcessInformation.dwProcessId;
  end else begin
    Assert(HWndOldApp <> 0);
    GetWindowThreadProcessId(HWndOldApp, @OldAppProcessID);
  Assert(OldAppProcessID <> 0);
  Process:= OpenProcess(PROCESS_ALL_ACCESS, False, OldAppProcessID);
  Assert(Process <> 0);
  Parameters:= VirtualAllocEx(Process, nil, Length(DLL), MEM_COMMIT, PAGE_EXECUTE_READWRITE);
  WriteProcessMemory(Process, Parameters, PAnsiChar(DLL), Length(DLL)+1, BytesWritten);
  hKernel:= GetModuleHandle('KERNEL32.DLL');
  pLoadLibrary:= GetProcAddress(hKernel, 'LoadLibraryA');
  Thread:= CreateRemoteThread(Process, nil, 0, pLoadLibrary, Parameters, 0, ThreadId);
  WaitForSingleObject(Thread, INFINITE);
  VirtualFreeEx(Process, Parameters, 0, MEM_RELEASE);


BTW: I omitted the error handling and try blocks everywhere for brevity.
Here’s how it works:
First we check to see if the old application is already running, if not we start it.
We get a ProcessID. Using this ID we gain access to the process.
Then we write a string (the name of the dll to load) in a bit of memory that we allocate in the address space of that application, we execute some code, wait for it to complete, and do clean up afterwards.
The code we execute is of course ‘LoadLibrary’ and the parameter we give is a pointer to the string we’ve just written.
This is text book code injection and this is bog standard.

Step 2: alter the old application using a dll written in exactly the same Delphi version as was used originally.

We need the old Delphi version (D7), because we need to make sure that the runtime library we’re using is exactly the same as the one the application is using.
We need to tread very carefully here and the smaller the impact we’re making on the old application the better it is.
We need to do the following things here:

  1. Start a procedure that does the following:
  2. Get a pointer to OldApp.MainForm, so we can inspect and activate new menus, buttons etc;
  3. Change the memory manager in our dll, so that it uses the memory manager of the application itself;
  4. Force the old application to load the dll with our shiny new functionality.

2.1 Start a procedure to do the work

We can’t just do work in dllmain. That may lock the application. In order to prevent deadlocks we will start a new thread and do our work there.

procedure EntryPoint(Reason: dword);
  ThreadID: cardinal;
  if Reason = DLL_PROCESS_ATTACH then begin
    Win32Check(CreateThread(nil,0, @Main, nil, 0, ThreadId));
  Assert(ThreadID <> 0);

2.2 Get a pointer to OldApp.MainForm

We need a bit of voodoo to make this happen. The trick is to inspect the property list of the Window using EnumProps; a pointer to MainForm.Self is stored in a WinAPI property called ControlOf. Here’s how to get that pointer.

function Main(IgnoreMe: pointer): DWord; stdcall;
   //Wait for the MainWindow to become visible and get its HWnd.
  if HWndOldApp <> 0 then begin
    EnumProps(HWndOldApp, @PropEnumProc);
function PropEnumProc(HWND: HWND; Prop: LPCTSTR; hData: integer): longbool; stdcall;
  OrigProp: string;
  //ShowMessage('start enum');
  Result:= true;
  if (Integer(Prop) and $FFFF0000) <> 0 then begin //skip atoms
    OrigProp:= string(Prop);
    if Pos('ControlOf', OrigProp) = 1 then begin
      //ShowMessage('ControlOf found: '+ OrigProp);
      MainFormSelf:= pointer(hData);
      Result:= false; //stop enumerating
    //showmessage('exit enum');

The PropEnumProc puts a pointer to MainForm in the global variable MainFormSelf and exits. Now we can use Delphi to manipulate the form. However we need to be careful because we don’t know the framework that was used to create the controls. We can’t assume that just because a button is of type TButton that it’s a stock control. The original programmer may just have been obtuse in their naming of controls.
I got this code from the internet somewhere, but I can’t find the source any more. If someone knows the original source I’ll put up a link.

2.3 Change the memory manager in our dll so that it uses the application’s MemMan.

We can’t just hook into the old application and run its code whilst using our own memory manager. Using ShareMem does no good, we don’t want to share we want to cuddle and so we need to get very intimate.

function Main(IgnoreMe: pointer): DWord; stdcall;

procedure HookMemMan;
  Data: PByte;
  OffsetGetMem: integer;
  Data:= @MemMan.GetMem;
  SetLength(MemSnippet, SizeOf(TMemSnippet));
  CopyMemory(@MemSnippet[0], Data, SizeOf(TMemSnippet));
  //Go look for the address of the routine in memory
  OffsetGetMem:= GetProcessBlock(MemSnippet, SizeOf(MemSnippet));
  MemMan.GetMem:= pointer(OffsetGetMem);

  Data:= @MemMan.FreeMem;
  SetLength(MemSnippet, SizeOf(TMemSnippet));
  CopyMemory(@MemSnippet[0], Data, SizeOf(TMemSnippet));
  //Go look for the address of the routine in memory
  OffsetGetMem:= GetProcessBlock(MemSnippet, SizeOf(MemSnippet));
  MemMan.FreeMem:= pointer(OffsetGetMem);

  Data:= @MemMan.ReallocMem;
  SetLength(MemSnippet, SizeOf(TMemSnippet));
  CopyMemory(@MemSnippet[0], Data, SizeOf(TMemSnippet));
  //Go look for the address of the routine in memory
  OffsetGetMem:= GetProcessBlock(MemSnippet, SizeOf(MemSnippet));
  MemMan.ReallocMem:= pointer(OffsetGetMem);

HookMemMan makes a few potentially dangerous assumptions:

  • The original application uses the default memory manager;
  • The memory routines are located somewhere near the start of the executable;
  • The function signatures of the memory routines are unique (or at least unique enough).

We do a simple search and compare in the memory space of the old application to find the addresses of GetMem, FreeMem and ReAllocMem. This involves getting read access to the memory where the code of the application is stored using OpenProcess, VirtualQueryEx and ReadProcessMemory.

Finally we bolt our memory manager to the old application’s one using SetMemoryManager. Now we can safely call functions inside the application without memory corruptions.
If the application were to use FastMM then we’d have to get a pointer to GetMemoryManager in the old application and use that instead.

However there is another snag….
We are running in a different thread, so we need to somehow execute our code inside the main thread of the application.
In order to solve this we simply set the OnIdle event of the mainform so that it will in due course run the code we want in the correct thread using the correct memory manager.

constructor TEventHandler.Create(App: TApplication);
  inherited Create;
  MyApp:= App;
  OldApplicationActivate:= MyApp.OnActivate;
  OldApplicationIdle:= MyApp.OnIdle;
  MyApp.OnActivate:= Self.DoActivate;
  MyApp.OnIdle:= Self.DoIdle;

If the old application has fancy Setters and getters for the event handlers, this might fail due to the fact that the property is not stored where the RTL in the dll thinks it is.
So this code assumes the old application has direct access properties as is usual and the setting of the pointer is atomic so you should have no issues.
If the event handlers are rewired using getters and setters, you need to read the published properties of the form and retrieve the setter from there, but you’ll also run into thread contention issues. 

2.4 Force the application to load our shiny new dll.

Prior to running this code we’ve changed the old exe file and added some menus and buttons as needed using Resource Hacker. Adding controls programmatically is a bad idea, because it breaks if the original application uses non-standard controls or has a patch in their VCL code.
In OldApp.OnIdle we change the click-handlers of these new controls so that they run the new forms:

procedure ScanInvoicesClick(Form: TForm; Sender: TMenuItem);
  HLib: HInst;
  ScanInvoive: TScanInvoice;
  HLib:= LoadLibrary('NewFuctionalityWrittenInDSeattle.dll');
  ScanInvoice:= GetProcAddress(HLib, 'ShowFormScan');
  if Assigned(ScanInvoice) then ScanInvoice(Form.Handle)
  else ShowMessage('cannot find NewFuctionalityWrittenInDSeattle.dll');

Showing the new forms

The new forms in NewFuctionalityWrittenInDSeattle.dll are completely stand-alone and share no state with the old application. The dll with the new functionality looks almost exactly like a stand-alone app. It is a library with a number of self-contained forms attached.
Every form has an exported function in the library code that looks like so:

procedure ShowFormScan(ParentForm: HWnd); export;
  FormScan: TFormScan;
  FormScan:= TFormScan.Create(nil);
  ShowForm(FormScan, ParentForm);

That’s it, now I can add and replace as many forms in the old application as I want.
What I cannot do however is alter code inside existing forms in my old application, that is just too much of a minefield.

Fun with record helpers

Delphi introduced record helper for built-in types a while ago (XE3).

These record helpers are much more powerful than you’d expect.
I found this out when I needed to hide some complex bit manipulation behind record helpers.

My real code involves lots of complex bit manipulation and partial population counts, but I’d like to use a simple example to demonstrate.

Suppose we have a int64 that holds 1 bit for every square on a checker board. A 1 means there’s a piece and a 0 means there’s no piece.
We can also keep track of just the white or just the black pieces and the combination of both.

unit Example;



  TPiece = type int64;
  TPieces = type TPiece;
  TCollision = type TPieces;

  TCollisionHelper = record helper for TCollision
    function IsHit(Piece: TPiece): boolean; overload;
    class function IsHit(White, Black: TPieces): TCollision; overload; static;
    function Count: integer; inline;

  TPiecesHelper = record helper for TPieces
    function Count: integer; inline; //popcount
    class function GetCount(Field: Int64): integer; static;

  TWhitePiece = type TPiece;

  TWhitePieceHelper = record helper for TWhitePiece
    procedure MoveTo(From, &To: TPoint);
    function CanJump: boolean; inline;

  TBlackPiece = type TPiece;

  TBlackPieceHelper = record helper for TBlackPiece
    procedure MoveByOne(From: TPoint; Direction: TDirection);
    function CanJump: boolean; inline;

  procedure Test;



{ TCollisionHelper }
function TCollisionHelper.IsHit(Piece: TPiece): boolean;
  Result:= (self and Piece) <> 0;

function TCollisionHelper.Count: integer;
  Result:= TPieces.GetCount(Self);

class function TCollisionHelper.IsHit(White, Black: TPieces): TCollision;
  Result:= White and Black;

class function TPiecesHelper.GetCount(Field: Int64): integer;
  {$ifdef CPUX86}
  popcnt eax,eax
  popcnt edx,edx
  add eax,edx
  {$ifdef CPUX64}
  popcnt rax,rcx

{ TPiecesHelper }
function TPiecesHelper.Count: integer;
  Result:= GetCount(Self);

{ TWhitePieceHelper }
procedure TWhitePieceHelper.MoveTo(From, &To: TPoint);
  writeln('white moves');

function TWhitePieceHelper.CanJump: boolean;
  Result:= true;

{ TBlackPieceHelper }
procedure TBlackPieceHelper.MoveByOne(From: TPoint; Direction: TDirection);
  writeln('black slides');

function TBlackPieceHelper.CanJump: boolean;
  Result:= false;

procedure Test;
  White, Black: TPieces;
  PieceA: TPiece;
  Clashes: TCollision;
  White:= Random(Int64.MaxValue);
  Black:= not(White);
  Clashes:= TCollision.IsHit(Black, White);



Why not use (enhanced) records?
The problem with records is that all the normal operators (or/xor+/-) don’t work out of the box. With plain types + helpers you get a little of both worlds.

I haven’t found a way to get proper inheritance; it is so close. I wish Embarcadero would get enhanced helpers done already.

Announcing Fastcode

I would like to announce the fastcode project.

The old fastcode project brought a lot of fast string handling and other code to the Win32 Delphi compiler.
Codegear was smart enough to incorporate the fast code into the RTL and every Win32 Delphi project is better because of it.

However things have changed since then.
Unicode came and wiped out many of the speedy hacks possible with Ansistrings.
The Win64 compiler was introduced with XE2 and OSX, android and iOS have been added as targets.
Looking at the sourcecode in the Win64 tree it’s obvious that this code has not been touched by the fastcode brush.
Especially the string routines are very inefficient.

Often strings are copied into temporary clones and naive byte by byte comparisons written in purepascal are used.

In order to remedy this I’ve decided to restart the fastcode project and see what speed gains are possible in X64 and purepascal.

Introducing FastDefaults
The first candidate for optimization is system.generics.defaults.
It uses a great number of very simple routines to compare two generic variables.

A typical example of an inefficient comparison is:

function Compare_UString(Inst: PSimpleInstance; const Left, Right: UnicodeString): Integer;
Result := CompareStr(Left, Right);

Looks innocent enough, except that `CompareStr` has only been optimized under Win32. On every other platform the following pure pascal code gets executed:

function CompareStr(const S1, S2: string): Integer;
I, Last, L1, L2, C1, C2, Ch1, Ch2: Integer;
L1 := Length(S1);
L2 := Length(S2);
result := L1 - L2;
if (L1 > 0) and (L2 > 0) then
if result < 0 then Last := L1 shl 1
else Last := L2 shl 1;

I := 0;
while I < Last do
C1 := PInteger(PByte(S1) + I)^;
C2 := PInteger(PByte(S2) + I)^;
if C1 <> C2 then
{ Compare first character }
Ch1 := C1 and $0000FFFF;
Ch2 := C2 and $0000FFFF;
if Ch1 <> Ch2 then
Exit(Ch1 – Ch2);

{ Compare second }
Ch1 := (C1 and $FFFF0000) shr 16;
Ch2 := (C2 and $FFFF0000) shr 16;
if Ch1 <> Ch2 then
Exit(Ch1 – Ch2);
inc(I, 4);

The handcrafted X64 assembly in System.Generics.FastDefaults is 8x faster!

Test : SpeedTest.TSpeedTest.SimpleRun.SimpleRun-2-of-5
Executing Test : SimpleRun-2-of-5

Empty loop took 0 millisec, 152 ticks
Slow compare loop took 1 millisec, 4203 ticks
Fast compare loop took 0 millisec, 557 ticks

This is just the first draft. Obviously there is room for improvement here.

Faster hashes
The hashing routines have also been improved.
Bob Jenkins lookup3 hash is starting to show its age and does not translate all that well into pascal.

Lets see what happens if we replace it with Murmurhash3 or FNV1a_Meiyan or Crc32c

Crc32 took: 6 millisec = 3701 ticks (13963 ticks in x64 {no idea why})
Murmur3asm took: 1 millisec = 3791 ticks output = -861992540
Murmur3 Pascal took: 2 millisec = 5937 ticks output = -861992540
Meiyan took: 1 millisec = 2767 ticks output = 726226886
MeiyanAsm took: 1 millisec = 2673 ticks output = 726226886
Bobjenkins took: 6 millisec = 13999 ticks output = -1996702505

Both crc32 and MurmurHash3 have excellent collision stats.
Meiyan is still untested, but is the fastest hash I could find online.
The possibility of a 5x speedup in hashing is nothing to sneeze at and the 3.5x speedup that the proven Murmur3 offers is pretty good as well.
Note that the crc32 profiled above is optimized x64/x86 assembly version of crc32c that does **not** use SSE4.2.

Because hashing is used extensively in TDictionary a better hash can speed up your application noticeably.

The fastcode repository can be found at:
Any comments, complaints and contributions are welcome.