## SuperFastHash from Paul Hsieh Translated to Delphi and Borland Assembler (BASM)

I've been a member of the FastCode community for a little while. The FastCode community tries to optimize core functions from Delphi using mostly BASM. After they have a very intensive benchmark en validation process the winner is suggested to Borland/CodeGear through a QC report, and is likely included in the next version of Delphi. I even have a few functions which are winners in these benchmarks and have a QC report with my function as a suggestion.

### How it started

Why this introduction? Well, the community lives in the Borland Newsgroup in the group borland.public.delphi.language.basm and it's also a general discussion place about 'hardcore' optimalizations for a Delphi function. More than a year ago Juhani Suhonen asked for a fast hash to use for his hashtable. I suggested the old but nicely performing elf-hash, but also noted a much better hash function I recently found. It was called SuperFastHash (SFH) and was created by Paul Hsieh to overcome his 'problems' with the hash functions from Bob Jenkins. Juhani asked if somebody could write the SFH function in basm. A few people worked on a basm implementation and posted it.

In this post I will roughly describe how I as a fresh assembler optimizer created my version of the SFH function.

### The Delphi implementation

I started with a regular Delphi implementation of the algorithm, using the c code from Paul's article as reference. After I got my Delphi implementation compiled and working I looked at the BASM code using the CPU debug view and tried to find points where I could help the Delphi compiler generating better assembly. Below is my initial pascal version. (beware I will post my **final** version later on in this post).

```
function SuperFastHash(AData:pointer; ADataLength: integer):longword;
// Pascal translation of the SuperFastHash function by Paul Hsieh
// more info: http://www.azillionmonkeys.com/qed/hash.html
// Translation by: Davy Landman
// No warranties, but have fun :)
var
TempPart: longword;
RemainingBytes: integer;
begin
if not Assigned(AData) or (ADataLength <= 0) then
begin
Result := 0;
Exit;
end;
Result := ADataLength;
RemainingBytes := ADataLength and 3;
ADataLength := ADataLength shr 2; // div 4, so var name is not correct anymore..
// main loop
while ADataLength > 0 do
begin
inc(Result, PWord(AData)^);
TempPart := (PWord(Pointer(Cardinal(AData)+2))^ shl 11) xor Result;
Result := (Result shl 16) xor TempPart;
AData := Pointer(Cardinal(AData) + 4);
inc(Result, Result shr 11);
dec(ADataLength);
end;
// end case
if RemainingBytes = 3 then
begin
inc(Result, PWord(AData)^);
Result := Result xor (Result shl 16);
Result := Result xor (PByte(Pointer(Cardinal(AData)+2))^ shl 18);
inc(Result, Result shr 11);
end
else if RemainingBytes = 2 then
begin
inc(Result, PWord(AData)^);
Result := Result xor (Result shl 11);
inc(Result, Result shr 17);
end
else if RemainingBytes = 1 then
begin
inc(Result, PByte(AData)^);
Result := Result xor (Result shl 10);
inc(Result, Result shr 1);
end;
// avalance
Result := Result xor (Result shl 3);
inc(Result, Result shr 5);
Result := Result xor (Result shl 4);
inc(Result, Result shr 17);
Result := Result xor (Result shl 25);
inc(Result, Result shr 6);
end;
```

Warning this is not final code!

### The BASM implementation

I posted this and later posted an assembly version of SFH, I used the CPU view and the code suggestions from Bob Gonder to create this BASM code. Using BASM you can control the exact usage of registers and you're able to fine tune the assembly much better (and easier) than writing Delphi code.

My enthusiasm got the better of me and I spent a whole lot of evenings optimizing the Delphi and assembly versions of the SFH function (while I should have been working on my graduation internship paper). The discussion on the newsgroup turned a few interesting corners about the different kind of definitions of a hashing function. Therefor I must note that this function should only be used for a hashtable! It's tuned for giving a good distribution but not for cryptographically or verification usage.

### The final unit

I think I've posted my final functions in the discussion thread, but recently I created a separate function file including some new parts. For instance, I added an functions for larger data which is the same algorithm, but uses less jumps because I unrolled the function. Below is my final unit of the SuperFastHash function.

```
unit unSuperFastHash;
(*
A Delphi and assembly translation of the SuperFastHash function by
Paul Hsieh (http://www.azillionmonkeys.com/qed/hash.html).
I got the idea for translating it due to borland.public.delphi.language.basm.
See the full discussion at:
http://groups.google.com/group/borland.public.delphi.language.basm/
browse_thread/thread/96482ba4d1d5a016/7745466ab714c3b3
***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
* The contents of this file are subject to the Mozilla Public License Version
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
*
* Software distributed under the License is distributed on an "AS IS" basis,
* WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
* for the specific language governing rights and limitations under the
* License.
*
* The Original Code is SuperFastHash Delphi and BASM translation.
*
* The Initial Developer of the Original Code is
* Davy Landman.
* Portions created by the Initial Developer are Copyright (C) 2007
* the Initial Developer. All Rights Reserved.
*
* Contributor(s):
*
*
* Alternatively, the contents of this file may be used under the terms of
* either the GNU General Public License Version 2 or later (the "GPL"), or
* the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
* in which case the provisions of the GPL or the LGPL are applicable instead
* of those above. If you wish to allow use of your version of this file only
* under the terms of either the GPL or the LGPL, and not to allow others to
* use your version of this file under the terms of the MPL, indicate your
* decision by deleting the provisions above and replace them with the notice
* and other provisions required by the GPL or the LGPL. If you do not delete
* the provisions above, a recipient may use your version of this file under
* the terms of any one of the MPL, the GPL or the LGPL.
*
* ***** END LICENSE BLOCK ***** *)
interface
{.$define ASMVersion}
function SuperFastHash(AData: pointer; ADataLength: Integer): Cardinal;
function SuperFastHashLargeData(AData: pointer; ADataLength: Integer): Cardinal;
implementation
// Pascal translation of the SuperFastHash function by Paul Hsieh
// more info: http://www.azillionmonkeys.com/qed/hash.html
function SuperFastHash(AData: pointer; ADataLength: Integer): Cardinal;
{$ifndef ASMVersion}
var
TempPart: Cardinal;
RemainingBytes: Integer;
RemainingDWords: Integer;
begin
if not Assigned(AData) or (ADataLength <= 0) then
begin
Result := 0;
Exit;
end;
Result := ADataLength;
RemainingBytes := ADataLength and 3; // mod 4
RemainingDWords := ADataLength shr 2; // div 4
// main loop
while RemainingDWords > 0 do
begin
Result := Result + PWord(AData)^;
// splitting the pointer math keeps the amount of registers pushed at 2
AData := Pointer(Cardinal(AData) + SizeOf(Word));
TempPart := (PWord(AData)^ shl 11) xor Result;
Result := (Result shl 16) xor TempPart;
AData := Pointer(Cardinal(AData) + SizeOf(Word));
Result := Result + (Result shr 11);
dec(RemainingDWords);
end;
// Handle end cases
if RemainingBytes = 3 then
begin
Result := Result + PWord(AData)^;
Result := Result xor (Result shl 16);
AData := Pointer(Cardinal(AData) + SizeOf(Word)); // skip to the last byte
Result := Result xor ((PByte(AData)^ shl 18));
Result := Result + (Result shr 11);
end
else if RemainingBytes = 2 then
begin
Result := Result + PWord(AData)^;
Result := Result xor (Result shl 11);
Result := Result + (Result shr 17);
end
else if RemainingBytes = 1 then
begin
Result := Result + PByte(AData)^;
Result := Result xor (Result shl 10);
Result := Result + (Result shr 1);
end;
// Force "avalanching" of final 127 bits
Result := Result xor (Result shl 3);
Result := Result + (Result shr 5);
Result := Result xor (Result shl 4);
Result := Result + (Result shr 17);
Result := Result xor (Result shl 25);
Result := Result + (Result shr 6);
{$else}
asm
push esi
push edi
test eax, eax // data
jz @Ret // eax is result
xchg edx, eax // swith data and length
test eax, eax // length, and hash
jle @Ret
@Start:
mov edi, eax
mov esi, eax
and edi, 3 // last few bytes
shr esi, 2 // number of DWORD loops
jz @Last3
@Loop:
movzx ecx, word ptr [edx]
add eax, ecx
movzx ecx, word ptr [edx + 2]
shl ecx, 11
xor ecx, eax
shl eax, 16
xor eax, ecx
mov ecx, eax
shr eax, 11
add eax, ecx
add edx, 4
dec esi
jnz @Loop
@Last3:
test edi, edi
jz @Done
dec edi
jz @OneLeft
dec edi
jz @TwoLeft
movzx ecx, word ptr [edx]
add eax, ecx
mov ecx, eax
shl eax, 16
xor eax, ecx
movsx ecx, byte ptr [edx + 2]
shl ecx, 18
xor eax, ecx
mov ecx, eax
shr ecx, 11
add eax, ecx
jmp @Done
@TwoLeft:
movzx ecx, word ptr [edx]
add eax, ecx
mov ecx, eax
shl eax, 11
xor eax, ecx
mov ecx, eax
shr eax, 17
add eax, ecx
jmp @Done
@OneLeft:
movsx ecx, byte ptr [edx]
add eax, ecx
mov ecx, eax
shl eax, 10
xor eax, ecx
mov ecx, eax
shr eax, 1
add eax, ecx
@Done:
// avalanche
mov ecx, eax
shl eax, 3
xor eax, ecx
mov ecx, eax
shr eax, 5
add eax, ecx
mov ecx, eax
shl eax, 4
xor eax, ecx
mov ecx, eax
shr eax, 17
add eax, ecx
mov ecx, eax
shl eax, 25
xor eax, ecx
mov ecx, eax
shr eax, 6
add eax, ecx
@Ret:
pop edi
pop esi
ret
{$endif}
end;
function SuperFastHashLargeData(AData: pointer; ADataLength: Integer): Cardinal;
{$ifndef ASMVersion}
type
TWordArray = array[0..(MaxInt div SizeOf(Word)) - 1] of Word;
PWordArray = ^TWordArray;
var
TempPart: Cardinal;
RemainingBytes: Integer;
RemainingDWords: Integer;
begin
if not Assigned(AData) or (ADataLength <= 0) then
begin
Result := 0;
Exit;
end;
Result := ADataLength;
RemainingBytes := ADataLength and 3;
RemainingDWords := ADataLength shr 2; // div 4
// large loop
while RemainingDWords >= 4 do
begin
Result := Result + PWord(AData)^;
TempPart := (PWordArray(AData)^[1] shl 11) xor Result;
Result := (Result shl 16) xor TempPart;
Result := Result + (Result shr 11);
Result := Result + PWordArray(AData)^[2];
TempPart := (PWordArray(AData)^[3] shl 11) xor Result;
Result := (Result shl 16) xor TempPart;
Result := Result + (Result shr 11);
Result := Result + PWordArray(AData)^[4];
TempPart := (PWordArray(AData)^[5] shl 11) xor Result;
Result := (Result shl 16) xor TempPart;
Result := Result + (Result shr 11);
Result := Result + PWordArray(AData)^[6];
TempPart := (PWordArray(AData)^[7] shl 11) xor Result;
Result := (Result shl 16) xor TempPart;
Result := Result + (Result shr 11);
// update the pointer and the counter
AData := Pointer(Cardinal(AData) + (8 * SizeOf(Word)));
RemainingDWords := RemainingDWords - 4;
end;
// small loop
while RemainingDWords > 0 do
begin
Result := Result + PWord(AData)^;
AData := Pointer(Cardinal(AData) + SizeOf(Word));
TempPart := (PWord(AData)^ shl 11) xor Result;
Result := (Result shl 16) xor TempPart;
AData := Pointer(Cardinal(AData) + SizeOf(Word));
Result := Result + (Result shr 11);
dec(RemainingDWords);
end;
// Handle end cases
if RemainingBytes = 3 then
begin
Result := Result + PWord(AData)^;
Result := Result xor (Result shl 16);
AData := Pointer(Cardinal(AData) + SizeOf(Word)); // skip to the last byte
Result := Result xor ((PByte(AData)^ shl 18));
Result := Result + (Result shr 11);
end
else if RemainingBytes = 2 then
begin
Result := Result + PWord(AData)^;
Result := Result xor (Result shl 11);
Result := Result + (Result shr 17);
end
else if RemainingBytes = 1 then
begin
Result := Result + PByte(AData)^;
Result := Result xor (Result shl 10);
Result := Result + (Result shr 1);
end;
// Force "avalanching" of final 127 bits
Result := Result xor (Result shl 3);
Result := Result + (Result shr 5);
Result := Result xor (Result shl 4);
Result := Result + (Result shr 17);
Result := Result xor (Result shl 25);
Result := Result + (Result shr 6);
{$else}
asm
push esi
push edi
test eax, eax // test for nil pointer
jz @Ret // eax is also result, so save ret here
xchg edx, eax // swith data and length
test eax, eax // length, and hash
jle @Ret
@Start:
mov edi, eax
mov esi, eax
and edi, 3 // last few bytes
shr esi, 2 // number of DWORD loops
jz @Last3
@LargeLoop:
cmp esi,$04
jl @Loop
// first DWORD
movzx ecx, word ptr [edx]
add eax, ecx
movzx ecx, word ptr [edx + 2]
shl ecx, 11
xor ecx, eax
shl eax, 16
xor eax, ecx
mov ecx, eax
shr eax, 11
add eax, ecx
// second DWORD
movzx ecx, word ptr [edx + 4]
add eax, ecx
movzx ecx, word ptr [edx + 6]
shl ecx, 11
xor ecx, eax
shl eax, 16
xor eax, ecx
mov ecx, eax
shr eax, 11
add eax, ecx
// third DWORD
movzx ecx, word ptr [edx + 8]
add eax, ecx
movzx ecx, word ptr [edx + 10]
shl ecx, 11
xor ecx, eax
shl eax, 16
xor eax, ecx
mov ecx, eax
shr eax, 11
add eax, ecx
// fourth DWORD
movzx ecx, word ptr [edx + 12]
add eax, ecx
movzx ecx, word ptr [edx + 14]
shl ecx, 11
xor ecx, eax
shl eax, 16
xor eax, ecx
mov ecx, eax
shr eax, 11
add eax, ecx
add edx, 16
sub esi, 4
jz @Last3
jmp @LargeLoop
@Loop:
movzx ecx, word ptr [edx]
add eax, ecx
movzx ecx, word ptr [edx + 2]
shl ecx, 11
xor ecx, eax
shl eax, 16
xor eax, ecx
mov ecx, eax
shr eax, 11
add eax, ecx
add edx, 4
dec esi
jnz @Loop
@Last3:
test edi, edi
jz @Done
dec edi
jz @OneLeft
dec edi
jz @TwoLeft
movzx ecx, word ptr [edx]
add eax, ecx
mov ecx, eax
shl eax, 16
xor eax, ecx
movsx ecx, byte ptr [edx + 2]
shl ecx, 18
xor eax, ecx
mov ecx, eax
shr ecx, 11
add eax, ecx
jmp @Done
@TwoLeft:
movzx ecx, word ptr [edx]
add eax, ecx
mov ecx, eax
shl eax, 11
xor eax, ecx
mov ecx, eax
shr eax, 17
add eax, ecx
jmp @Done
@OneLeft:
movsx ecx, byte ptr [edx]
add eax, ecx
mov ecx, eax
shl eax, 10
xor eax, ecx
mov ecx, eax
shr eax, 1
add eax, ecx
@Done:
// avalanche
mov ecx, eax
shl eax, 3
xor eax, ecx
mov ecx, eax
shr eax, 5
add eax, ecx
mov ecx, eax
shl eax, 4
xor eax, ecx
mov ecx, eax
shr eax, 17
add eax, ecx
mov ecx, eax
shl eax, 25
xor eax, ecx
mov ecx, eax
shr eax, 6
add eax, ecx
@Ret:
pop edi
pop esi
ret
{$endif}
end;
end.
```

To use this code you'll only have to download the unit and choose if you want to use the Pascal versions or the Assembly versions. You can switch between this using the {$define ASMVersion} preprocessor.

### Difference between my version and Paul Hsieh's

The only difference between my Pascal code and Paul's c code is the pointer mathematics in the core loop. Below is the core loop c code.

```
for (;len > 0; len--) {
hash += get16bits (data);
tmp = (get16bits (data+2) << 11) ^ hash;
hash = (hash << 16) ^ tmp;
data += 2*sizeof (uint16_t);
hash += hash >> 11;
}
```

And this is my Pascal version of this code.

```
while RemainingDWords > 0 do
begin
Result := Result + PWord(AData)^;
// splitting the pointer math keeps the amount of registers pushed at 2
AData := Pointer(Cardinal(AData) + SizeOf(Word));
TempPart := (PWord(AData)^ shl 11) xor Result;
Result := (Result shl 16) xor TempPart;
AData := Pointer(Cardinal(AData) + SizeOf(Word));
Result := Result + (Result shr 11);
dec(RemainingDWords);
end;
```

Paul's code increments the pointer at the end data += 2*sizeof (uint16_t);, and uses get16bits (data+2) to get to the second Word while mine increases the data pointer in between, because the Delphi Compiler will not compile PWord(Pointer(Cardinal(AData)+2))^ into movzx ecx, word ptr [edx + 2] but will store the intermediate pointer in a extra register which causes performance loss.

The funny thing is, although I tried a lot of different ways to get a better performance I ended up with an implementation almost identical to Paul's. I hadn't found the assembly version he created until after I wrote my own, and I think it's nice to see that apart from the initialization and the case statement they are the same.

Next time I will post my benchmarking code.