Landman Code Exploring my outer regions of coding.

A blog about C#, Delphi, assembler and general developer stuff.

Landman Code Exploring my outer regions of coding.

A blog about C#, Delphi, assembler and general developer stuff.

SuperFastHash from Paul Hsieh Translated to Delphi and Borland Assembler (BASM)

This cheetah picture from flickr is from Don Van Dyke and has the nc-nd-2.0 license. I've been a member of the FastCode community for a little while. The FastCode community tries to optimize core functions from Delphi using mostly BASM. After they have a very intensive benchmark en validation process the winner is suggested to Borland/CodeGear through a QC report, and is likely included in the next version of Delphi. I even have a few functions which are winners in these benchmarks and have a QC report with my function as a suggestion.

How it started

Why this introduction? Well, the community lives in the Borland Newsgroup in the group borland.public.delphi.language.basm and it's also a general discussion place about 'hardcore' optimalizations for a Delphi function. More than a year ago Juhani Suhonen asked for a fast hash to use for his hashtable. I suggested the old but nicely performing elf-hash, but also noted a much better hash function I recently found. It was called SuperFastHash (SFH) and was created by Paul Hsieh to overcome his 'problems' with the hash functions from Bob Jenkins. Juhani asked if somebody could write the SFH function in basm. A few people worked on a basm implementation and posted it.

In this post I will roughly describe how I as a fresh assembler optimizer created my version of the SFH function.

The Delphi implementation

I started with a regular Delphi implementation of the algorithm, using the c code from Paul's article as reference. After I got my Delphi implementation compiled and working I looked at the BASM code using the CPU debug view and tried to find points where I could help the Delphi compiler generating better assembly. Below is my initial pascal version. (beware I will post my final version later on in this post).

function SuperFastHash(AData:pointer; ADataLength: integer):longword;
// Pascal translation of the SuperFastHash function by Paul Hsieh
// more info: http://www.azillionmonkeys.com/qed/hash.html
// Translation by: Davy Landman
// No warranties, but have fun :)
var
  TempPart: longword;
  RemainingBytes: integer;
begin
  if not Assigned(AData) or (ADataLength <= 0) then
  begin
    Result := 0;
    Exit;
  end;
  Result := ADataLength;
  RemainingBytes := ADataLength and 3;
  ADataLength := ADataLength shr 2; // div 4, so var name is not correct anymore..
  // main loop
  while ADataLength > 0 do
  begin
    inc(Result, PWord(AData)^);
    TempPart := (PWord(Pointer(Cardinal(AData)+2))^ shl 11) xor Result;
    Result := (Result shl 16) xor TempPart;
    AData := Pointer(Cardinal(AData) + 4);
    inc(Result, Result shr 11);
    dec(ADataLength);
  end;
  // end case
  if RemainingBytes = 3 then
  begin
    inc(Result, PWord(AData)^);
    Result := Result xor (Result shl 16);
    Result := Result xor (PByte(Pointer(Cardinal(AData)+2))^ shl 18);
    inc(Result, Result shr 11);
  end
  else if RemainingBytes = 2 then
  begin
    inc(Result, PWord(AData)^);
    Result := Result xor (Result shl 11);
    inc(Result, Result shr 17);
  end
  else if RemainingBytes = 1 then
  begin
    inc(Result, PByte(AData)^);
    Result := Result xor (Result shl 10);
    inc(Result, Result shr 1);
  end;
  // avalance
  Result := Result xor (Result shl 3);
  inc(Result, Result shr 5);
  Result := Result xor (Result shl 4);
  inc(Result, Result shr 17);
  Result := Result xor (Result shl 25);
  inc(Result, Result shr 6);
end;

Warning this is not final code!

The BASM implementation

I posted this and later posted an assembly version of SFH, I used the CPU view and the code suggestions from Bob Gonder to create this BASM code. Using BASM you can control the exact usage of registers and you're able to fine tune the assembly much better (and easier) than writing Delphi code.

My enthusiasm got the better of me and I spent a whole lot of evenings optimizing the Delphi and assembly versions of the SFH function (while I should have been working on my graduation internship paper). The discussion on the newsgroup turned a few interesting corners about the different kind of definitions of a hashing function. Therefor I must note that this function should only be used for a hashtable! It's tuned for giving a good distribution but not for cryptographically or verification usage.

The final unit

I think I've posted my final functions in the discussion thread, but recently I created a separate function file including some new parts. For instance, I added an functions for larger data which is the same algorithm, but uses less jumps because I unrolled the function. Below is my final unit of the SuperFastHash function.

unit unSuperFastHash;
(*
  A Delphi and assembly translation of the SuperFastHash function by
  Paul Hsieh (http://www.azillionmonkeys.com/qed/hash.html).

  I got the idea for translating it due to borland.public.delphi.language.basm.
  See the full discussion at:
  http://groups.google.com/group/borland.public.delphi.language.basm/
  browse_thread/thread/96482ba4d1d5a016/7745466ab714c3b3

 ***** BEGIN LICENSE BLOCK *****
 * Version: MPL 1.1/GPL 2.0/LGPL 2.1
 *
 * The contents of this file are subject to the Mozilla Public License Version
 * 1.1 (the "License"); you may not use this file except in compliance with
 * the License. You may obtain a copy of the License at
 * http://www.mozilla.org/MPL/
 *
 * Software distributed under the License is distributed on an "AS IS" basis,
 * WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
 * for the specific language governing rights and limitations under the
 * License.
 *
 * The Original Code is SuperFastHash Delphi and BASM translation.
 *
 * The Initial Developer of the Original Code is
 * Davy Landman.
 * Portions created by the Initial Developer are Copyright (C) 2007
 * the Initial Developer. All Rights Reserved.
 *
 * Contributor(s):
 *
 *
 * Alternatively, the contents of this file may be used under the terms of
 * either the GNU General Public License Version 2 or later (the "GPL"), or
 * the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
 * in which case the provisions of the GPL or the LGPL are applicable instead
 * of those above. If you wish to allow use of your version of this file only
 * under the terms of either the GPL or the LGPL, and not to allow others to
 * use your version of this file under the terms of the MPL, indicate your
 * decision by deleting the provisions above and replace them with the notice
 * and other provisions required by the GPL or the LGPL. If you do not delete
 * the provisions above, a recipient may use your version of this file under
 * the terms of any one of the MPL, the GPL or the LGPL.
 *
 * ***** END LICENSE BLOCK ***** *)

interface

{.$define ASMVersion}

function SuperFastHash(AData: pointer; ADataLength: Integer): Cardinal;
function SuperFastHashLargeData(AData: pointer; ADataLength: Integer): Cardinal;

implementation

// Pascal translation of the SuperFastHash function by Paul Hsieh
// more info: http://www.azillionmonkeys.com/qed/hash.html
function SuperFastHash(AData: pointer; ADataLength: Integer): Cardinal;
{$ifndef ASMVersion}
var
  TempPart: Cardinal;
  RemainingBytes: Integer;
  RemainingDWords: Integer;
begin
  if not Assigned(AData) or (ADataLength <= 0) then
  begin
    Result := 0;
    Exit;
  end;
  Result := ADataLength;
  RemainingBytes := ADataLength and 3; // mod 4
  RemainingDWords := ADataLength shr 2; // div 4

  // main loop
  while RemainingDWords > 0 do
  begin
    Result := Result + PWord(AData)^;
    // splitting the pointer math keeps the amount of registers pushed at 2
    AData  := Pointer(Cardinal(AData) + SizeOf(Word));
    TempPart := (PWord(AData)^ shl 11) xor Result;
    Result := (Result shl 16) xor TempPart;
    AData  := Pointer(Cardinal(AData) + SizeOf(Word));
    Result := Result + (Result shr 11);
    dec(RemainingDWords);
  end;
  // Handle end cases
  if RemainingBytes = 3 then
  begin
    Result := Result +    PWord(AData)^;
    Result := Result xor (Result shl 16);
    AData  := Pointer(Cardinal(AData) + SizeOf(Word));   // skip to the last byte
    Result := Result xor ((PByte(AData)^ shl 18));
    Result := Result +   (Result shr 11);
  end
  else if RemainingBytes = 2 then
  begin
    Result := Result +    PWord(AData)^;
    Result := Result xor (Result shl 11);
    Result := Result +   (Result shr 17);
  end
  else if RemainingBytes = 1 then
  begin
    Result := Result +    PByte(AData)^;
    Result := Result xor (Result shl 10);
    Result := Result +   (Result shr 1);
  end;
  // Force "avalanching" of final 127 bits
  Result := Result xor (Result shl 3);
  Result := Result +   (Result shr 5);
  Result := Result xor (Result shl 4);
  Result := Result +   (Result shr 17);
  Result := Result xor (Result shl 25);
  Result := Result +   (Result shr 6);
{$else}
asm
    push  esi
    push  edi
    test  eax, eax // data
    jz    @Ret // eax is result
    xchg  edx, eax // swith data and length
    test  eax, eax // length, and hash
    jle    @Ret
@Start:
    mov   edi, eax
    mov   esi, eax
    and   edi, 3    // last few bytes
    shr   esi, 2    // number of DWORD loops
    jz    @Last3
@Loop:
    movzx ecx, word ptr [edx]
    add   eax, ecx
    movzx ecx, word ptr [edx + 2]

    shl   ecx, 11
    xor   ecx, eax
    shl   eax, 16

    xor   eax, ecx
    mov   ecx, eax

    shr   eax, 11
    add   eax, ecx
    add   edx, 4
    dec   esi
    jnz   @Loop
@Last3:
    test  edi, edi
    jz    @Done
    dec   edi
    jz    @OneLeft
    dec   edi
    jz    @TwoLeft

    movzx ecx, word ptr [edx]
    add   eax, ecx
    mov   ecx, eax
    shl   eax, 16
    xor   eax, ecx
    movsx ecx, byte ptr [edx + 2]
    shl   ecx, 18
    xor   eax, ecx
    mov   ecx, eax
    shr   ecx, 11
    add   eax, ecx
    jmp   @Done
@TwoLeft:
    movzx ecx, word ptr [edx]
    add   eax, ecx
    mov   ecx, eax
    shl   eax, 11
    xor   eax, ecx
    mov   ecx, eax
    shr   eax, 17
    add   eax, ecx
    jmp   @Done
@OneLeft:
    movsx ecx, byte ptr [edx]
    add   eax, ecx
    mov   ecx, eax
    shl   eax, 10
    xor   eax, ecx
    mov   ecx, eax
    shr   eax, 1
    add   eax, ecx
@Done:
    // avalanche
    mov   ecx, eax
    shl   eax, 3
    xor   eax, ecx

    mov   ecx, eax
    shr   eax, 5
    add   eax, ecx

    mov   ecx, eax
    shl   eax, 4
    xor   eax, ecx

    mov   ecx, eax
    shr   eax, 17
    add   eax, ecx

    mov   ecx, eax
    shl   eax, 25
    xor   eax, ecx

    mov   ecx, eax
    shr   eax, 6
    add   eax, ecx
@Ret:
    pop   edi
    pop   esi
    ret
{$endif}
end;

function SuperFastHashLargeData(AData: pointer; ADataLength: Integer): Cardinal;
{$ifndef ASMVersion}
type
  TWordArray = array[0..(MaxInt div SizeOf(Word)) - 1] of Word;
  PWordArray = ^TWordArray;
var
  TempPart: Cardinal;
  RemainingBytes: Integer;
  RemainingDWords: Integer;
begin
  if not Assigned(AData) or (ADataLength <= 0) then
  begin
    Result := 0;
    Exit;
  end;
  Result := ADataLength;
  RemainingBytes := ADataLength and 3;
  RemainingDWords := ADataLength shr 2; // div 4
  // large loop
  while RemainingDWords >= 4 do
  begin
    Result := Result + PWord(AData)^;
    TempPart := (PWordArray(AData)^[1] shl 11) xor Result;
    Result := (Result shl 16) xor TempPart;
    Result := Result + (Result shr 11);

    Result := Result + PWordArray(AData)^[2];
    TempPart := (PWordArray(AData)^[3] shl 11) xor Result;
    Result := (Result shl 16) xor TempPart;
    Result := Result + (Result shr 11);

    Result := Result + PWordArray(AData)^[4];
    TempPart := (PWordArray(AData)^[5] shl 11) xor Result;
    Result := (Result shl 16) xor TempPart;
    Result := Result + (Result shr 11);

    Result := Result + PWordArray(AData)^[6];
    TempPart := (PWordArray(AData)^[7] shl 11) xor Result;
    Result := (Result shl 16) xor TempPart;
    Result := Result + (Result shr 11);

    // update the pointer and the counter
    AData  := Pointer(Cardinal(AData) + (8 * SizeOf(Word)));
    RemainingDWords := RemainingDWords - 4;
  end;
  // small loop
  while RemainingDWords > 0 do
  begin
    Result := Result + PWord(AData)^;
    AData  := Pointer(Cardinal(AData) + SizeOf(Word));
    TempPart := (PWord(AData)^ shl 11) xor Result;
    Result := (Result shl 16) xor TempPart;
    AData  := Pointer(Cardinal(AData) + SizeOf(Word));
    Result := Result + (Result shr 11);
    dec(RemainingDWords);
  end;
  // Handle end cases
  if RemainingBytes = 3 then
  begin
    Result := Result +    PWord(AData)^;
    Result := Result xor (Result shl 16);
    AData  := Pointer(Cardinal(AData) + SizeOf(Word));   // skip to the last byte
    Result := Result xor ((PByte(AData)^ shl 18));
    Result := Result +   (Result shr 11);
  end
  else if RemainingBytes = 2 then
  begin
    Result := Result +    PWord(AData)^;
    Result := Result xor (Result shl 11);
    Result := Result +   (Result shr 17);
  end
  else if RemainingBytes = 1 then
  begin
    Result := Result +    PByte(AData)^;
    Result := Result xor (Result shl 10);
    Result := Result +   (Result shr 1);
  end;
    // Force "avalanching" of final 127 bits
  Result := Result xor (Result shl 3);
  Result := Result +   (Result shr 5);
  Result := Result xor (Result shl 4);
  Result := Result +   (Result shr 17);
  Result := Result xor (Result shl 25);
  Result := Result +   (Result shr 6);
{$else}
 asm
    push  esi
    push  edi
    test  eax, eax // test for nil pointer
    jz    @Ret     // eax is also result, so save ret here
    xchg  edx, eax // swith data and length
    test  eax, eax // length, and hash
    jle    @Ret
@Start:
    mov   edi, eax
    mov   esi, eax
    and   edi, 3    // last few bytes
    shr   esi, 2    // number of DWORD loops
    jz    @Last3
@LargeLoop:
    cmp esi,$04
    jl @Loop
    // first DWORD
    movzx ecx, word ptr [edx]
    add   eax, ecx
    movzx ecx, word ptr [edx + 2]

    shl   ecx, 11
    xor   ecx, eax
    shl   eax, 16

    xor   eax, ecx
    mov   ecx, eax

    shr   eax, 11
    add   eax, ecx
    // second DWORD
    movzx ecx, word ptr [edx + 4]
    add   eax, ecx
    movzx ecx, word ptr [edx + 6]

    shl   ecx, 11
    xor   ecx, eax
    shl   eax, 16

    xor   eax, ecx
    mov   ecx, eax

    shr   eax, 11
    add   eax, ecx

    // third DWORD
    movzx ecx, word ptr [edx + 8]
    add   eax, ecx
    movzx ecx, word ptr [edx + 10]

    shl   ecx, 11
    xor   ecx, eax
    shl   eax, 16

    xor   eax, ecx
    mov   ecx, eax

    shr   eax, 11
    add   eax, ecx

    // fourth DWORD
    movzx ecx, word ptr [edx + 12]
    add   eax, ecx
    movzx ecx, word ptr [edx + 14]

    shl   ecx, 11
    xor   ecx, eax
    shl   eax, 16

    xor   eax, ecx
    mov   ecx, eax

    shr   eax, 11
    add   eax, ecx

    add   edx, 16
    sub   esi, 4
    jz    @Last3
    jmp   @LargeLoop
@Loop:
    movzx ecx, word ptr [edx]
    add   eax, ecx
    movzx ecx, word ptr [edx + 2]

    shl   ecx, 11
    xor   ecx, eax
    shl   eax, 16

    xor   eax, ecx
    mov   ecx, eax

    shr   eax, 11
    add   eax, ecx
    add   edx, 4
    dec   esi
    jnz   @Loop
@Last3:
    test  edi, edi
    jz    @Done
    dec   edi
    jz    @OneLeft
    dec   edi
    jz    @TwoLeft

    movzx ecx, word ptr [edx]
    add   eax, ecx
    mov   ecx, eax
    shl   eax, 16
    xor   eax, ecx
    movsx ecx, byte ptr [edx + 2]
    shl   ecx, 18
    xor   eax, ecx
    mov   ecx, eax
    shr   ecx, 11
    add   eax, ecx
    jmp   @Done
@TwoLeft:
    movzx ecx, word ptr [edx]
    add   eax, ecx
    mov   ecx, eax
    shl   eax, 11
    xor   eax, ecx
    mov   ecx, eax
    shr   eax, 17
    add   eax, ecx
    jmp   @Done
@OneLeft:
    movsx ecx, byte ptr [edx]
    add   eax, ecx
    mov   ecx, eax
    shl   eax, 10
    xor   eax, ecx
    mov   ecx, eax
    shr   eax, 1
    add   eax, ecx
@Done:
    // avalanche
    mov   ecx, eax
    shl   eax, 3
    xor   eax, ecx

    mov   ecx, eax
    shr   eax, 5
    add   eax, ecx

    mov   ecx, eax
    shl   eax, 4
    xor   eax, ecx

    mov   ecx, eax
    shr   eax, 17
    add   eax, ecx

    mov   ecx, eax
    shl   eax, 25
    xor   eax, ecx

    mov   ecx, eax
    shr   eax, 6
    add   eax, ecx
@Ret:
    pop   edi
    pop   esi
    ret
{$endif}
end;

end.

To use this code you'll only have to download the unit and choose if you want to use the Pascal versions or the Assembly versions. You can switch between this using the {$define ASMVersion} preprocessor.

Difference between my version and Paul Hsieh's

The only difference between my Pascal code and Paul's c code is the pointer mathematics in the core loop. Below is the core loop c code.

for (;len > 0; len--) {
    hash  += get16bits (data);
    tmp    = (get16bits (data+2) << 11) ^ hash;
    hash   = (hash << 16) ^ tmp;
    data  += 2*sizeof (uint16_t);
    hash  += hash >> 11;
}

And this is my Pascal version of this code.

while RemainingDWords > 0 do
begin
  Result := Result + PWord(AData)^;
  // splitting the pointer math keeps the amount of registers pushed at 2
  AData  := Pointer(Cardinal(AData) + SizeOf(Word));
  TempPart := (PWord(AData)^ shl 11) xor Result;
  Result := (Result shl 16) xor TempPart;
  AData  := Pointer(Cardinal(AData) + SizeOf(Word));
  Result := Result + (Result shr 11);
  dec(RemainingDWords);
end;

Paul's code increments the pointer at the end data += 2*sizeof (uint16_t);, and uses get16bits (data+2) to get to the second Word while mine increases the data pointer in between, because the Delphi Compiler will not compile PWord(Pointer(Cardinal(AData)+2))^ into movzx ecx, word ptr [edx + 2] but will store the intermediate pointer in a extra register which causes performance loss.

The funny thing is, although I tried a lot of different ways to get a better performance I ended up with an implementation almost identical to Paul's. I hadn't found the assembly version he created until after I wrote my own, and I think it's nice to see that apart from the initialization and the case statement they are the same.

Next time I will post my benchmarking code.

Tags: , , ,