close search bar

Sorry, not available in this language yet

close language selection

Cracking XenForo corpuses: An unsupported sha256(sha256($pass).$salt) hash type

Travis Biehn

Mar 26, 2018 / 56 min read

A list that recently hit Hashes.org, with 1 million records and a low crack rate, looked like an interesting target, given that the community had recovered less than 0.5% of the hashes. On taking a closer look, we quickly found out why:

Out of the box, only JtR Jumbo carries support for the XenForo hash as a “dynamic hash.”


[List.Generic:dynamic_1503]
Expression=sha256(sha256($p).$s) (XenForo SHA-256)
Flag=MGF_INPUT_32_BYTE
Flag=MGF_SALTED
Flag=MGF_FLAT_BUFFERS
Flag=MGF_KEYS_BASE16_IN1_SHA256
MaxInputLenX86=110
SaltLen=-120 // dont know, so made it big, jfoug
MaxInputLen=110
Func=DynamicFunc__set_input_len_64
Func=DynamicFunc__append_salt
Func=DynamicFunc__SHA256_crypt_input1_to_output1_FINAL
Test=$dynamic_1503$453f2e21fa6c150670d3ecf0e4a0ff3bab8b1903c2e96ad655d960b95
f104248$697de9eda4a02563a7ec66d42d4a96995cb2948e29ab76fbcc89e8db71dd10f1:password
Test=$dynamic_1503$a8a0e9545c1475e8546f8546d87fe2516cf525c12ad79a6a7a8fee2fb0
d8afd3$697de9eda4a02563a7ec66d42d4a96995cb2948e29ab76fbcc89e8db71dd10f1:
verlongcrappypassword01234567890

JtR’s dynamic hashes don’t run on OpenCL, and a benchmark on a single core resulted in 4MH/s, while during a run we hit around 80MH/s across 20 CPU Cores.
Graphic Illustration of John the Ripper Performance in CPU Mode

                                                            Around 4 MH/s from JtR in CPU mode.

What we had was a pretense for writing a custom OpenCL kernel for Hashcat—perfect. Adding to the effort was the lack of documentation around Hashcat (and sparse code comments), as well as no easily identifiable public information on rolling your own OpenCL Hashcat kernels. OpenCL is an exotic environment; there are some gaps in how you go about debugging or iterating on these kernels. The general methodology was running Hashcat in single-hash mode against the kernel and using printf() tombstones to understand what was going on under the hood. Hashcat’s other kernels and its OpenSSL-style functions are definitely key to getting something working in a limited amount of time.

Orientation


./hashcat/src/interface.h
This file contains enums and values required to add a new hash type.

./hashcat/src/interface.c
This file contains blocks that describe each hash type. It contains strings for Hashcat’s help function, as well as various constants and hash-parsing functions.

./hashcat/kernel/
This directory contains compiled OpenCL kernels. Hashcat compiles each OpenCL kernel at runtime and caches those kernels in this directory.

./hashcat/OpenCL/
This directory contains OpenCL kernel sources (.cl files).

Play-by-play Inspiration

To start with, identify an OpenCL a0 kernel (or set of kernels) that closely matches your workload and input type. For example, if your hash is salted, find a salted hash of the same type, and likewise if it’s iterated.

Copy it to its own kernel file, designated as m[Hashcat Mode Flag]_a0.cl. Search/replace this source to fix up the mode function names.

Adding the mode

Next up, add the new mode to Hashcat.

SHA256_PW_SHA256_SLT was added as an enum to interface.h:

KERN_TYPE_SHA256_PWSLT = 1410,
KERN_TYPE_SHA256_PW_SHA256_SLT = 1415,
KERN_TYPE_SHA256_SLTPW = 1420,

HT_* appears to be CLI / help-related and was added to interface.c:

case 1411: return HT_01411;
case 1415: return HT_01415;
case 1420: return HT_01420;
static const char *HT_01410 = "sha256($pass.$salt)";
static const char *HT_01415 = "sha256(sha256($pass).$salt)";
static const char *HT_01420 = "sha256($salt.$pass)";

A test hash plus its corresponding plaintext are used for self-testing:

static const char *ST_PASS_HASHCAT_PEANUT = "peanut";
...
static const char *ST_HASH_01415 = "00050655d5d6b8a8c14d52e852ce930fcd38e0551f161d71999860bba72e52aa:b4d93efdf7899fed
0944a8bd19890a72764a9e169668d4c602fc6f1199eea449";
...
case 1411: hashconfig->hash_type = HASH_TYPE_SHA256;
...

And in the main hash configuration block, the values were lifted from 1410, sha256($plain.$salt).

An important bit here is to set kern_type properly. sha256s_parse_hash parses the hash list. If your hash’s format doesn’t match exactly, it will be rejected. Hashcat has a robust set of these parsing functions, but you’ll need to hunt through them. opti_type doesn’t seem to do anything.

break;
case 1415: hashconfig->hash_type = HASH_TYPE_SHA256;
hashconfig->salt_type = SALT_TYPE_GENERIC;
hashconfig->attack_exec = ATTACK_EXEC_INSIDE_KERNEL;
hashconfig->opts_type = OPTS_TYPE_PT_GENERATE_BE
| OPTS_TYPE_ST_ADD80
| OPTS_TYPE_ST_ADDBITS15;
hashconfig->kern_type = KERN_TYPE_SHA256_PW_SHA256_SLT;
hashconfig->dgst_size = DGST_SIZE_4_8;
hashconfig->parse_func = sha256s_parse_hash;
hashconfig->opti_type = OPTI_TYPE_ZERO_BYTE
| OPTI_TYPE_PRECOMPUTE_INIT
| OPTI_TYPE_PRECOMPUTE_MERKLE
| OPTI_TYPE_EARLY_SKIP
| OPTI_TYPE_NOT_ITERATED
| OPTI_TYPE_APPENDED_SALT
| OPTI_TYPE_RAW_HASH;
hashconfig->dgst_pos0 = 3;
hashconfig->dgst_pos1 = 7;
hashconfig->dgst_pos2 = 2;
hashconfig->dgst_pos3 = 6;
hashconfig->st_hash = ST_HASH_01415;
hashconfig->st_pass = ST_PASS_HASHCAT_PEANUT;
break;

Quick and dirty test

After a make clean && make, Hashcat should list your new type. Benchmark mode won’t be useful for testing, as the a0 kernels aren’t used here. Running a simple single-hash test against the new kernel will cause Hashcat to build it from the OpenCL directory to the kernel directory. This is when you’ll receive compile-time errors. There’s probably a way to directly invoke the OpenCL build toolchain to speed this up.

In this case, I had some known hash/salt/plaintext values (recoverable from JtR and publicly listed). These will be absolutely essential. It also helps to have the raw results of intermediate steps in the scheme for debugging.

I’ll use this example for testing:

00050655d5d6b8a8c14d52e852ce930fcd38e0551f161d71999860bba72e52aa:b4d93efdf789
9fed0944a8bd19890a72764a9e169668d4c602fc6f1199eea449:peanut

Steps from the command line:

#echo -n peanut | shasum -a 256
5509840d0873adb0405588821197a8634501293486c601ca51e14063abe25d06 -
#echo -n 5509840d0873adb0405588821197a8634501293486c601ca51e14063abe25d06b4d93efdf7
899fed0944a8bd19890a72764a9e169668d4c602fc6f1199eea449 | shasum -a 256
00050655d5d6b8a8c14d52e852ce930fcd38e0551f161d71999860bba72e52aa -

Hashcat test:

./hashcat -m 1415 -a 0 "00050655d5d6b8a8c14d52e852ce930fcd38e0551f161d71999860bba72e52aa:b4d93efdf7899fed0
944a8bd19890a72764a9e169668d4c602fc6f1199eea449" peanut.txt

hashcat (v4.0.1-123-g2095e27d+) starting...
OpenCL Platform #1: Apple

=========================

* Device #1: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz, skipped.
* Device #2: Intel(R) HD Graphics 530, 384/1536 MB allocatable, 24MCU
* Device #3: AMD Radeon Pro 460 Compute Engine, 1024/4096 MB allocatable, 16MCU

...
Watchdog: Temperature terminate trigger disabled.
* Device #2: ATTENTION! OpenCL kernel self-test failed.
Your device driver installation is probably broken.

See also: https://hashcat.net/faq/wrongdriver
* Device #3: ATTENTION! OpenCL kernel self-test failed.
Your device driver installation is probably broken.

See also: https://hashcat.net/faq/wrongdriver

Let’s get these self-tests to pass with this single hash.

Kicking out the jams

The majority of the work on an a0 kernel will be focused on these two functions:

__kernel void m01415_mxx
This function is invoked to handle cracking multiple target hashes at once. We can ignore this during testing.

__kernel void m01415_sxx
This function is invoked in single-hash mode (which is how we’re testing). We can focus our development effort here.

Let’s take a walk through the donor kernel’s __kernel void m01410_sxx function:

/**
* digest
*/
const u32 search[4] =
{
digests_buf[digests_offset].digest_buf[DGST_R0],
digests_buf[digests_offset].digest_buf[DGST_R1],
digests_buf[digests_offset].digest_buf[DGST_R2],
digests_buf[digests_offset].digest_buf[DGST_R3]
};

In this kernel, Hashcat doesn’t bother comparing all the bytes in the hash buffer—a basic optimization technique. It’s unlikely to collide with a human-meaningful candidate as selected by Hashcat, even when doing partial matches. The bytes are configured in the hashconfig struct:

hashconfig->dgst_pos0 = 3;
hashconfig->dgst_pos1 = 7;
hashconfig->dgst_pos2 = 2;
hashconfig->dgst_pos3 = 6;

It’s recommended that you use values from another kernel that has the same final hash round as your target.

/**
* base
*/
COPY_PW (pws[gid]);
const u32 salt_len = salt_bufs[salt_pos].salt_len;
u32 s[64] = { 0 };
for (int i = 0, idx = 0; i < salt_len; i += 4, idx += 1)
{
s[idx] = swap32_S (salt_bufs[salt_pos].salt_buf[idx]);
}

The kernel prepares the salt buffer once and carries out operations usually performed by the Hashcat sha256 library’s sha256_update_swap.

Now is a good time to browse through the methods in inc_hash_sha256.cl. These aren’t documented in any way, and while they resemble other low-level C crypto implementations, some things have been changed to permit more efficient cracking. It helps to select methods out of this library, and those in its hash family (e.g., sha512, sha1), to understand how these functions can be used in various contexts.

/**
* loop
*/
for (u32 il_pos = 0; il_pos < il_cnt; il_pos++)
{
pw_t tmp = PASTE_PW;
tmp.pw_len = apply_rules (rules_buf[il_pos].cmds, tmp.i, tmp.pw_len);
sha256_ctx_t ctx;
sha256_init (&ctx);
sha256_update_swap (&ctx, tmp.i, tmp.pw_len);
sha256_update (&ctx, s, salt_len);
sha256_final (&ctx);
const u32 r0 = ctx.h[DGST_R0];
const u32 r1 = ctx.h[DGST_R1];
const u32 r2 = ctx.h[DGST_R2];
const u32 r3 = ctx.h[DGST_R3];
COMPARE_S_SCALAR (r0, r1, r2, r3);
}

The function’s main loop permutes the password root candidate tmp.i according to rules identified in the rules_buf array. It then establishes a sha256 context, updates the context with the password candidate buffer, updates it (minus the swap already performed) with the salt, and then finalizes the hash. It then compares the bytes outlined in hashconfig for a match.

Because our source function is sha256($plain.$salt) and our target is sha256(sha256($plain).$salt), we know we need to add a round of sha256 to the plaintext before updating the final sha256 context, ctx.

What are some other modes that might carry out similar operations?

  • 40 | md5($salt.utf16le($pass)) | raw hash, salted, and/or iterated
  • 3710 | md5($salt.md5($pass)) | raw hash, salted, and/or iterated
  • 140 | sha1($salt.utf16le($pass)) | raw hash, salted, and/or iterated
  • 4520 | sha1($salt.sha1($pass)) | raw hash, salted, and/or iterated
  • 1440 | sha256($salt.utf16le($pass)) | raw hash, salted, and/or iterated
  • 1740 | sha512($salt.utf16le($pass)) | raw hash, salted, and/or iterated
  • 30 | md5(utf16le($pass).$salt) | raw hash, salted, and/or iterated
  • 130 | sha1(utf16le($pass).$salt) | raw hash, salted, and/or iterated
  • 1430 | sha256(utf16le($pass).$salt) | raw hash, salted, and/or iterated
  • 1730 | sha512(utf16le($pass).$salt) | raw hash, salted, and/or iterated

Some nice candidates include m02610_a0.cl, m04700_a0.cl, m04520_a0.cl, m04500_a0.cl, and m04400_a0.cl. This builds up more examples of how the Hashcat devs used the libraries and OpenCL environment in an efficient manner, and gives you some things to consume and understand.

It’s worth mentioning that Hashcat appears to follow a naming convention for these: xxxyy, where xxx denotes the family and yy is either 00 for unsalted, 10 for a post-pended salt, or 20 for a pre-pended salt. There are plenty of counterexamples, but this is the general scheme.

Let’s take a look at an example of a salted hash, with an extra round of hash on the plaintext. Thankfully, we have one in the same hash family: 4520, sha1($salt.sha1($pass)).

Here’s the core routine:

sha1_ctx_t ctx1;
sha1_init (&ctx1);
sha1_update_swap (&ctx1, tmp.i, tmp.pw_len);
sha1_final (&ctx1);
const u32 a = ctx1.h[0];
const u32 b = ctx1.h[1];
const u32 c = ctx1.h[2];
const u32 d = ctx1.h[3];
const u32 e = ctx1.h[4];
sha1_ctx_t ctx = ctx0;
u32 w0[4];
u32 w1[4];
u32 w2[4];
u32 w3[4];
w0[0] = uint_to_hex_lower8_le ((a >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a >> 24) & 255) << 16;
w0[1] = uint_to_hex_lower8_le ((a >> 0) & 255) << 0
| uint_to_hex_lower8_le ((a >> 8) & 255) << 16;
w0[2] = uint_to_hex_lower8_le ((b >> 16) & 255) << 0
| uint_to_hex_lower8_le ((b >> 24) & 255) << 16;
w0[3] = uint_to_hex_lower8_le ((b >> 0) & 255) << 0
| uint_to_hex_lower8_le ((b >> 8) & 255) << 16;
w1[0] = uint_to_hex_lower8_le ((c >> 16) & 255) << 0
| uint_to_hex_lower8_le ((c >> 24) & 255) << 16;
w1[1] = uint_to_hex_lower8_le ((c >> 0) & 255) << 0
| uint_to_hex_lower8_le ((c >> 8) & 255) << 16;
w1[2] = uint_to_hex_lower8_le ((d >> 16) & 255) << 0
| uint_to_hex_lower8_le ((d >> 24) & 255) << 16;
w1[3] = uint_to_hex_lower8_le ((d >> 0) & 255) << 0
| uint_to_hex_lower8_le ((d >> 8) & 255) << 16;
w2[0] = uint_to_hex_lower8_le ((e >> 16) & 255) << 0
| uint_to_hex_lower8_le ((e >> 24) & 255) << 16;
w2[1] = uint_to_hex_lower8_le ((e >> 0) & 255) << 0
| uint_to_hex_lower8_le ((e >> 8) & 255) << 16;
w2[2] = 0;
w2[3] = 0;
w3[0] = 0;
w3[1] = 0;
w3[2] = 0;
w3[3] = 0;
sha1_update_64 (&ctx, w0, w1, w2, w3, 40);
sha1_final (&ctx);

So we can see that this routine uses two sha1 contexts: ctx1 to contain sha1($pass) and ctx to contain sha1($salt.sha1($pass)). Additionally, this kernel introduces a new macro, uint_to_hex_lower8_le, and adds some logic to the _sxx function:

const u64 lsz = get_local_size (0);
/**
* bin2asc table
*/
__local u32 l_bin2asc[256];
for (u32 i = lid; i < 256; i += lsz)
{
const u32 i0 = (i >> 0) & 15;
const u32 i1 = (i >> 4) & 15;
l_bin2asc[i] = ((i0 < 10) ? '0' + i0 : 'a' - 10 + i0) << 0
| ((i1 < 10) ? '0' + i1 : 'a' - 10 + i1) << 8;
}

uint_to_hex_lower8_le looks up a byte (4 bits) in l_bin2asc, which returns a lowercase ASCII representation in 2 bytes (8 bits). We rely on this to convert the raw hex output of the first sha1 round to the lowercase ASCII representation that is usually composed in these salted routines. In this case, the routine assumes little-endian byte ordering, which is appropriate for the SHA family but can differ; for example, the MD5 Hashcat libs are big-endian.

The next step is understanding how to use this construct for sha256. We can find that sha1 has an output of 40 bytes, while sha256 outputs 64 bytes. Let’s set this up in our new kernel. First, add the uint_to_hex_lower8_le macro to the beginning of the file. Then add the routine to build the l_bin2asc array. Next, let’s change the core crack routine:

/**
* loop
*/
for (u32 il_pos = 0; il_pos < il_cnt; il_pos++)
{
pw_t tmp = PASTE_PW;
tmp.pw_len = apply_rules (rules_buf[il_pos].cmds, tmp.i, tmp.pw_len);
sha256_ctx_t ctx; //containing the final context
sha256_init (&ctx);
sha256_ctx_t ctx1; //containing the sha256($pass) context
sha256_init (&ctx1);
sha256_update_swap (&ctx1, tmp.i, tmp.pw_len);
sha256_final (&ctx1); //finalize the sha256 rounds.
const u32 a = ctx1.h[0];
const u32 b = ctx1.h[1];
const u32 c = ctx1.h[2];
const u32 d = ctx1.h[3];
const u32 e = ctx1.h[4];
u32 w0[4];
u32 w1[4];
u32 w2[4];
u32 w3[4];
w0[0] = uint_to_hex_lower8_le ((a >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a >> 24) & 255) << 16;
w0[1] = uint_to_hex_lower8_le ((a >> 0) & 255) << 0
| uint_to_hex_lower8_le ((a >> 8) & 255) << 16;
w0[2] = uint_to_hex_lower8_le ((b >> 16) & 255) << 0
| uint_to_hex_lower8_le ((b >> 24) & 255) << 16;
w0[3] = uint_to_hex_lower8_le ((b >> 0) & 255) << 0
| uint_to_hex_lower8_le ((b >> 8) & 255) << 16;
w1[0] = uint_to_hex_lower8_le ((c >> 16) & 255) << 0
| uint_to_hex_lower8_le ((c >> 24) & 255) << 16;
w1[1] = uint_to_hex_lower8_le ((c >> 0) & 255) << 0
| uint_to_hex_lower8_le ((c >> 8) & 255) << 16;
w1[2] = uint_to_hex_lower8_le ((d >> 16) & 255) << 0
| uint_to_hex_lower8_le ((d >> 24) & 255) << 16;
w1[3] = uint_to_hex_lower8_le ((d >> 0) & 255) << 0
| uint_to_hex_lower8_le ((d >> 8) & 255) << 16;
w2[0] = uint_to_hex_lower8_le ((e >> 16) & 255) << 0
| uint_to_hex_lower8_le ((e >> 24) & 255) << 16;
w2[1] = uint_to_hex_lower8_le ((e >> 0) & 255) << 0
| uint_to_hex_lower8_le ((e >> 8) & 255) << 16;
w2[2] = 0;
w2[3] = 0;
w3[0] = 0;
w3[1] = 0;
w3[2] = 0;
w3[3] = 0;
sha256_update_64 (&ctx, w0, w1, w2, w3, 40); // prepend the internal sha256 hash.
sha256_update (&ctx, s, salt_len); // add the salt
sha256_final (&ctx); // finalize
const u32 r0 = ctx.h[DGST_R0];
const u32 r1 = ctx.h[DGST_R1];
const u32 r2 = ctx.h[DGST_R2];
const u32 r3 = ctx.h[DGST_R3];
COMPARE_S_SCALAR (r0, r1, r2, r3);
}

Next up, we need to address the extra bytes available in sha256’s output:

const u32 a = ctx1.h[0];
const u32 b = ctx1.h[1];
const u32 c = ctx1.h[2];
const u32 d = ctx1.h[3];
const u32 e = ctx1.h[4];
const u32 a1 = ctx1.h[5];
const u32 b1 = ctx1.h[6];
const u32 c1 = ctx1.h[7];
u32 w0[4];
u32 w1[4];
u32 w2[4];
u32 w3[4];
w0[0] = uint_to_hex_lower8_le ((a >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a >> 24) & 255) << 16;
w0[1] = uint_to_hex_lower8_le ((a >> 0) & 255) << 0
| uint_to_hex_lower8_le ((a >> 8) & 255) << 16;
w0[2] = uint_to_hex_lower8_le ((b >> 16) & 255) << 0
| uint_to_hex_lower8_le ((b >> 24) & 255) << 16;
w0[3] = uint_to_hex_lower8_le ((b >> 0) & 255) << 0
| uint_to_hex_lower8_le ((b >> 8) & 255) << 16;
w1[0] = uint_to_hex_lower8_le ((c >> 16) & 255) << 0
| uint_to_hex_lower8_le ((c >> 24) & 255) << 16;
w1[1] = uint_to_hex_lower8_le ((c >> 0) & 255) << 0
| uint_to_hex_lower8_le ((c >> 8) & 255) << 16;
w1[2] = uint_to_hex_lower8_le ((d >> 16) & 255) << 0
| uint_to_hex_lower8_le ((d >> 24) & 255) << 16;
w1[3] = uint_to_hex_lower8_le ((d >> 0) & 255) << 0
| uint_to_hex_lower8_le ((d >> 8) & 255) << 16;
w2[0] = uint_to_hex_lower8_le ((e >> 16) & 255) << 0
| uint_to_hex_lower8_le ((e >> 24) & 255) << 16;
w2[1] = uint_to_hex_lower8_le ((e >> 0) & 255) << 0
| uint_to_hex_lower8_le ((e >> 8) & 255) << 16;
w2[2] = uint_to_hex_lower8_le ((a1 >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a1 >> 24) & 255) << 16;
w2[3] = uint_to_hex_lower8_le ((a1 >> 0) & 255) << 0
| uint_to_hex_lower8_le ((a1 >> 8) & 255) << 16;
w3[0] = uint_to_hex_lower8_le ((b1 >> 16) & 255) << 0
| uint_to_hex_lower8_le ((b1 >> 24) & 255) << 16;
w3[1] = uint_to_hex_lower8_le ((b1 >> 0) & 255) << 0
| uint_to_hex_lower8_le ((b1 >> 8) & 255) << 16;
w3[2] = uint_to_hex_lower8_le ((c1 >> 16) & 255) << 0
| uint_to_hex_lower8_le ((c1 >> 24) & 255) << 16;
w3[3] = uint_to_hex_lower8_le ((c1 >> 0) & 255) << 0
| uint_to_hex_lower8_le ((c1 >> 8) & 255) << 16;
sha256_update_64 (&ctx, w0, w1, w2, w3, 64); // prepend the internal sha256 hash.

In my case, it took a few attempts to get the bit shifting / byte ordering correct. Here’s what I did:

I knew, ultimately, that w0, w1, w2, and w3 needed to hold the ASCII representation of the sha256 hash, using our example peanut:

#echo -n peanut | shasum -a 256 | xxd
00000000: 3535 3039 3834 3064 3038 3733 6164 6230 5509840d0873adb0
00000010: 3430 3535 3838 3832 3131 3937 6138 3633 405588821197a863
00000020: 3435 3031 3239 3334 3836 6336 3031 6361 4501293486c601ca
00000030: 3531 6531 3430 3633 6162 6532 3564 3036 51e14063abe25d06
00000040: 2020 2d0a

The array should look like this:

w0[0] = 0x35353039;
w0[1] = 0x38343064;
w0[2] = 0x30383733;
w0[3] = 0x61646230;
w1[0] = 0x34303535;
w1[1] = 0x38383832;
w1[2] = 0x31313937;
w1[3] = 0x61383633;
w2[0] = 0x34353031;
w2[1] = 0x32393334;
w2[2] = 0x38366336;
w2[3] = 0x30316361;
w3[0] = 0x35316531;
w3[1] = 0x34303633;
w3[2] = 0x61626532;
w3[3] = 0x35643036;

These hard-coded values passed the Hashcat self-test—progress!

By hard-coding this array in the loop, we can play with byte ordering and bit shifting until w0[0] holds 0x35353039:

w0[1] = uint_to_hex_lower8_le ((a >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a >> 24) & 255) << 16; //0x38343064;
w0[2] = 0x30383733;
w0[3] = 0x61646230;
w1[0] = 0x34303535;
w1[1] = 0x38383832;
w1[2] = 0x31313937;
w1[3] = 0x61383633;
w2[0] = 0x34353031;
w2[1] = 0x32393334;
w2[2] = 0x38366336;
w2[3] = 0x30316361;
w3[0] = 0x35316531;
w3[1] = 0x34303633;
w3[2] = 0x61626532;
w3[3] = 0x35643036;
printf("\na = 0x%08x, w0[0]=0x%08x\n", a,w[0]); // tombstone.

You can debug the kernel by alternating rm -rf ./hashcat/kernels with your Hashcat single-hash test command. This was essential to getting the byte ordering and conversion correct.

With the right values in w0, w1, w2, and w3, we can move the _sxx changes over to _mxx. Just reapply your changes there while making sure the final comparison function is COMPARE_M instead of COMPARE_S.

Play-by-play Inspiration

To start with, identify an OpenCL a0 kernel (or set of kernels) that closely matches your workload and input type. For example, if your hash is salted, find a salted hash of the same type, and likewise if it’s iterated.

Copy it to its own kernel file, designated as m[Hashcat Mode Flag]_a0.cl. Search/replace this source to fix up the mode function names.

Adding the mode

Next up, add the new mode to Hashcat.

SHA256_PW_SHA256_SLT was added as an enum to interface.h:

KERN_TYPE_SHA256_PWSLT = 1410,
KERN_TYPE_SHA256_PW_SHA256_SLT = 1415,
KERN_TYPE_SHA256_SLTPW = 1420,

HT_* appears to be CLI / help-related and was added to interface.c:

case 1411: return HT_01411;
case 1415: return HT_01415;
case 1420: return HT_01420;
static const char *HT_01410 = "sha256($pass.$salt)";
static const char *HT_01415 = "sha256(sha256($pass).$salt)";
static const char *HT_01420 = "sha256($salt.$pass)";

A test hash plus its corresponding plaintext are used for self-testing:

static const char *ST_PASS_HASHCAT_PEANUT = "peanut";
...
static const char *ST_HASH_01415 = "00050655d5d6b8a8c14d52e852ce930fcd38e0551f161d71999860bba72e52aa:b4d93efdf7899fed
0944a8bd19890a72764a9e169668d4c602fc6f1199eea449";
...
case 1411: hashconfig->hash_type = HASH_TYPE_SHA256;
...

And in the main hash configuration block, the values were lifted from 1410, sha256($plain.$salt).

An important bit here is to set kern_type properly. sha256s_parse_hash parses the hash list. If your hash’s format doesn’t match exactly, it will be rejected. Hashcat has a robust set of these parsing functions, but you’ll need to hunt through them. opti_type doesn’t seem to do anything.

break;
case 1415: hashconfig->hash_type = HASH_TYPE_SHA256;
hashconfig->salt_type = SALT_TYPE_GENERIC;
hashconfig->attack_exec = ATTACK_EXEC_INSIDE_KERNEL;
hashconfig->opts_type = OPTS_TYPE_PT_GENERATE_BE
| OPTS_TYPE_ST_ADD80
| OPTS_TYPE_ST_ADDBITS15;
hashconfig->kern_type = KERN_TYPE_SHA256_PW_SHA256_SLT;
hashconfig->dgst_size = DGST_SIZE_4_8;
hashconfig->parse_func = sha256s_parse_hash;
hashconfig->opti_type = OPTI_TYPE_ZERO_BYTE
| OPTI_TYPE_PRECOMPUTE_INIT
| OPTI_TYPE_PRECOMPUTE_MERKLE
| OPTI_TYPE_EARLY_SKIP
| OPTI_TYPE_NOT_ITERATED
| OPTI_TYPE_APPENDED_SALT
| OPTI_TYPE_RAW_HASH;
hashconfig->dgst_pos0 = 3;
hashconfig->dgst_pos1 = 7;
hashconfig->dgst_pos2 = 2;
hashconfig->dgst_pos3 = 6;
hashconfig->st_hash = ST_HASH_01415;
hashconfig->st_pass = ST_PASS_HASHCAT_PEANUT;
break;

Quick and dirty test

After a make clean && make, Hashcat should list your new type. Benchmark mode won’t be useful for testing, as the a0 kernels aren’t used here. Running a simple single-hash test against the new kernel will cause Hashcat to build it from the OpenCL directory to the kernel directory. This is when you’ll receive compile-time errors. There’s probably a way to directly invoke the OpenCL build toolchain to speed this up.

In this case, I had some known hash/salt/plaintext values (recoverable from JtR and publicly listed). These will be absolutely essential. It also helps to have the raw results of intermediate steps in the scheme for debugging.

I’ll use this example for testing:

00050655d5d6b8a8c14d52e852ce930fcd38e0551f161d71999860bba72e52aa:b4d93efdf789
9fed0944a8bd19890a72764a9e169668d4c602fc6f1199eea449:peanut

Steps from the command line:

#echo -n peanut | shasum -a 256
5509840d0873adb0405588821197a8634501293486c601ca51e14063abe25d06 -
#echo -n 5509840d0873adb0405588821197a8634501293486c601ca51e14063abe25d06b4d93efdf7
899fed0944a8bd19890a72764a9e169668d4c602fc6f1199eea449 | shasum -a 256
00050655d5d6b8a8c14d52e852ce930fcd38e0551f161d71999860bba72e52aa -

Hashcat test:

./hashcat -m 1415 -a 0 "00050655d5d6b8a8c14d52e852ce930fcd38e0551f161d71999860bba72e52aa:b4d93efdf7899fed0
944a8bd19890a72764a9e169668d4c602fc6f1199eea449" peanut.txt

hashcat (v4.0.1-123-g2095e27d+) starting...
OpenCL Platform #1: Apple

=========================

* Device #1: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz, skipped.
* Device #2: Intel(R) HD Graphics 530, 384/1536 MB allocatable, 24MCU
* Device #3: AMD Radeon Pro 460 Compute Engine, 1024/4096 MB allocatable, 16MCU

...
Watchdog: Temperature terminate trigger disabled.
* Device #2: ATTENTION! OpenCL kernel self-test failed.
Your device driver installation is probably broken.

See also: https://hashcat.net/faq/wrongdriver
* Device #3: ATTENTION! OpenCL kernel self-test failed.
Your device driver installation is probably broken.

See also: https://hashcat.net/faq/wrongdriver

Let’s get these self-tests to pass with this single hash.

Kicking out the jams

The majority of the work on an a0 kernel will be focused on these two functions:

__kernel void m01415_mxx
This function is invoked to handle cracking multiple target hashes at once. We can ignore this during testing.

__kernel void m01415_sxx
This function is invoked in single-hash mode (which is how we’re testing). We can focus our development effort here.

Let’s take a walk through the donor kernel’s __kernel void m01410_sxx function:

/**
* digest
*/
const u32 search[4] =
{
digests_buf[digests_offset].digest_buf[DGST_R0],
digests_buf[digests_offset].digest_buf[DGST_R1],
digests_buf[digests_offset].digest_buf[DGST_R2],
digests_buf[digests_offset].digest_buf[DGST_R3]
};

In this kernel, Hashcat doesn’t bother comparing all the bytes in the hash buffer—a basic optimization technique. It’s unlikely to collide with a human-meaningful candidate as selected by Hashcat, even when doing partial matches. The bytes are configured in the hashconfig struct:

hashconfig->dgst_pos0 = 3;
hashconfig->dgst_pos1 = 7;
hashconfig->dgst_pos2 = 2;
hashconfig->dgst_pos3 = 6;

It’s recommended that you use values from another kernel that has the same final hash round as your target.

/**
* base
*/
COPY_PW (pws[gid]);
const u32 salt_len = salt_bufs[salt_pos].salt_len;
u32 s[64] = { 0 };
for (int i = 0, idx = 0; i < salt_len; i += 4, idx += 1)
{
s[idx] = swap32_S (salt_bufs[salt_pos].salt_buf[idx]);
}

The kernel prepares the salt buffer once and carries out operations usually performed by the Hashcat sha256 library’s sha256_update_swap.

Now is a good time to browse through the methods in inc_hash_sha256.cl. These aren’t documented in any way, and while they resemble other low-level C crypto implementations, some things have been changed to permit more efficient cracking. It helps to select methods out of this library, and those in its hash family (e.g., sha512, sha1), to understand how these functions can be used in various contexts.

/**
* loop
*/
for (u32 il_pos = 0; il_pos < il_cnt; il_pos++)
{
pw_t tmp = PASTE_PW;
tmp.pw_len = apply_rules (rules_buf[il_pos].cmds, tmp.i, tmp.pw_len);
sha256_ctx_t ctx;
sha256_init (&ctx);
sha256_update_swap (&ctx, tmp.i, tmp.pw_len);
sha256_update (&ctx, s, salt_len);
sha256_final (&ctx);
const u32 r0 = ctx.h[DGST_R0];
const u32 r1 = ctx.h[DGST_R1];
const u32 r2 = ctx.h[DGST_R2];
const u32 r3 = ctx.h[DGST_R3];
COMPARE_S_SCALAR (r0, r1, r2, r3);
}

The function’s main loop permutes the password root candidate tmp.i according to rules identified in the rules_buf array. It then establishes a sha256 context, updates the context with the password candidate buffer, updates it (minus the swap already performed) with the salt, and then finalizes the hash. It then compares the bytes outlined in hashconfig for a match.

Because our source function is sha256($plain.$salt) and our target is sha256(sha256($plain).$salt), we know we need to add a round of sha256 to the plaintext before updating the final sha256 context, ctx.

What are some other modes that might carry out similar operations?

  • 40 | md5($salt.utf16le($pass)) | raw hash, salted, and/or iterated
  • 3710 | md5($salt.md5($pass)) | raw hash, salted, and/or iterated
  • 140 | sha1($salt.utf16le($pass)) | raw hash, salted, and/or iterated
  • 4520 | sha1($salt.sha1($pass)) | raw hash, salted, and/or iterated
  • 1440 | sha256($salt.utf16le($pass)) | raw hash, salted, and/or iterated
  • 1740 | sha512($salt.utf16le($pass)) | raw hash, salted, and/or iterated
  • 30 | md5(utf16le($pass).$salt) | raw hash, salted, and/or iterated
  • 130 | sha1(utf16le($pass).$salt) | raw hash, salted, and/or iterated
  • 1430 | sha256(utf16le($pass).$salt) | raw hash, salted, and/or iterated
  • 1730 | sha512(utf16le($pass).$salt) | raw hash, salted, and/or iterated

Some nice candidates include m02610_a0.cl, m04700_a0.cl, m04520_a0.cl, m04500_a0.cl, and m04400_a0.cl. This builds up more examples of how the Hashcat devs used the libraries and OpenCL environment in an efficient manner, and gives you some things to consume and understand.

It’s worth mentioning that Hashcat appears to follow a naming convention for these: xxxyy, where xxx denotes the family and yy is either 00 for unsalted, 10 for a post-pended salt, or 20 for a pre-pended salt. There are plenty of counterexamples, but this is the general scheme.

Let’s take a look at an example of a salted hash, with an extra round of hash on the plaintext. Thankfully, we have one in the same hash family: 4520, sha1($salt.sha1($pass)).

Here’s the core routine:

sha1_ctx_t ctx1;
sha1_init (&ctx1);
sha1_update_swap (&ctx1, tmp.i, tmp.pw_len);
sha1_final (&ctx1);
const u32 a = ctx1.h[0];
const u32 b = ctx1.h[1];
const u32 c = ctx1.h[2];
const u32 d = ctx1.h[3];
const u32 e = ctx1.h[4];
sha1_ctx_t ctx = ctx0;
u32 w0[4];
u32 w1[4];
u32 w2[4];
u32 w3[4];
w0[0] = uint_to_hex_lower8_le ((a >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a >> 24) & 255) << 16;
w0[1] = uint_to_hex_lower8_le ((a >> 0) & 255) << 0
| uint_to_hex_lower8_le ((a >> 8) & 255) << 16;
w0[2] = uint_to_hex_lower8_le ((b >> 16) & 255) << 0
| uint_to_hex_lower8_le ((b >> 24) & 255) << 16;
w0[3] = uint_to_hex_lower8_le ((b >> 0) & 255) << 0
| uint_to_hex_lower8_le ((b >> 8) & 255) << 16;
w1[0] = uint_to_hex_lower8_le ((c >> 16) & 255) << 0
| uint_to_hex_lower8_le ((c >> 24) & 255) << 16;
w1[1] = uint_to_hex_lower8_le ((c >> 0) & 255) << 0
| uint_to_hex_lower8_le ((c >> 8) & 255) << 16;
w1[2] = uint_to_hex_lower8_le ((d >> 16) & 255) << 0
| uint_to_hex_lower8_le ((d >> 24) & 255) << 16;
w1[3] = uint_to_hex_lower8_le ((d >> 0) & 255) << 0
| uint_to_hex_lower8_le ((d >> 8) & 255) << 16;
w2[0] = uint_to_hex_lower8_le ((e >> 16) & 255) << 0
| uint_to_hex_lower8_le ((e >> 24) & 255) << 16;
w2[1] = uint_to_hex_lower8_le ((e >> 0) & 255) << 0
| uint_to_hex_lower8_le ((e >> 8) & 255) << 16;
w2[2] = 0;
w2[3] = 0;
w3[0] = 0;
w3[1] = 0;
w3[2] = 0;
w3[3] = 0;
sha1_update_64 (&ctx, w0, w1, w2, w3, 40);
sha1_final (&ctx);

So we can see that this routine uses two sha1 contexts: ctx1 to contain sha1($pass) and ctx to contain sha1($salt.sha1($pass)). Additionally, this kernel introduces a new macro, uint_to_hex_lower8_le, and adds some logic to the _sxx function:

const u64 lsz = get_local_size (0);
/**
* bin2asc table
*/
__local u32 l_bin2asc[256];
for (u32 i = lid; i < 256; i += lsz)
{
const u32 i0 = (i >> 0) & 15;
const u32 i1 = (i >> 4) & 15;
l_bin2asc[i] = ((i0 < 10) ? '0' + i0 : 'a' - 10 + i0) << 0
| ((i1 < 10) ? '0' + i1 : 'a' - 10 + i1) << 8;
}

uint_to_hex_lower8_le looks up a byte (4 bits) in l_bin2asc, which returns a lowercase ASCII representation in 2 bytes (8 bits). We rely on this to convert the raw hex output of the first sha1 round to the lowercase ASCII representation that is usually composed in these salted routines. In this case, the routine assumes little-endian byte ordering, which is appropriate for the SHA family but can differ; for example, the MD5 Hashcat libs are big-endian.

The next step is understanding how to use this construct for sha256. We can find that sha1 has an output of 40 bytes, while sha256 outputs 64 bytes. Let’s set this up in our new kernel. First, add the uint_to_hex_lower8_le macro to the beginning of the file. Then add the routine to build the l_bin2asc array. Next, let’s change the core crack routine:

/**
* loop
*/
for (u32 il_pos = 0; il_pos < il_cnt; il_pos++)
{
pw_t tmp = PASTE_PW;
tmp.pw_len = apply_rules (rules_buf[il_pos].cmds, tmp.i, tmp.pw_len);
sha256_ctx_t ctx; //containing the final context
sha256_init (&ctx);
sha256_ctx_t ctx1; //containing the sha256($pass) context
sha256_init (&ctx1);
sha256_update_swap (&ctx1, tmp.i, tmp.pw_len);
sha256_final (&ctx1); //finalize the sha256 rounds.
const u32 a = ctx1.h[0];
const u32 b = ctx1.h[1];
const u32 c = ctx1.h[2];
const u32 d = ctx1.h[3];
const u32 e = ctx1.h[4];
u32 w0[4];
u32 w1[4];
u32 w2[4];
u32 w3[4];
w0[0] = uint_to_hex_lower8_le ((a >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a >> 24) & 255) << 16;
w0[1] = uint_to_hex_lower8_le ((a >> 0) & 255) << 0
| uint_to_hex_lower8_le ((a >> 8) & 255) << 16;
w0[2] = uint_to_hex_lower8_le ((b >> 16) & 255) << 0
| uint_to_hex_lower8_le ((b >> 24) & 255) << 16;
w0[3] = uint_to_hex_lower8_le ((b >> 0) & 255) << 0
| uint_to_hex_lower8_le ((b >> 8) & 255) << 16;
w1[0] = uint_to_hex_lower8_le ((c >> 16) & 255) << 0
| uint_to_hex_lower8_le ((c >> 24) & 255) << 16;
w1[1] = uint_to_hex_lower8_le ((c >> 0) & 255) << 0
| uint_to_hex_lower8_le ((c >> 8) & 255) << 16;
w1[2] = uint_to_hex_lower8_le ((d >> 16) & 255) << 0
| uint_to_hex_lower8_le ((d >> 24) & 255) << 16;
w1[3] = uint_to_hex_lower8_le ((d >> 0) & 255) << 0
| uint_to_hex_lower8_le ((d >> 8) & 255) << 16;
w2[0] = uint_to_hex_lower8_le ((e >> 16) & 255) << 0
| uint_to_hex_lower8_le ((e >> 24) & 255) << 16;
w2[1] = uint_to_hex_lower8_le ((e >> 0) & 255) << 0
| uint_to_hex_lower8_le ((e >> 8) & 255) << 16;
w2[2] = 0;
w2[3] = 0;
w3[0] = 0;
w3[1] = 0;
w3[2] = 0;
w3[3] = 0;
sha256_update_64 (&ctx, w0, w1, w2, w3, 40); // prepend the internal sha256 hash.
sha256_update (&ctx, s, salt_len); // add the salt
sha256_final (&ctx); // finalize
const u32 r0 = ctx.h[DGST_R0];
const u32 r1 = ctx.h[DGST_R1];
const u32 r2 = ctx.h[DGST_R2];
const u32 r3 = ctx.h[DGST_R3];
COMPARE_S_SCALAR (r0, r1, r2, r3);
}

Next up, we need to address the extra bytes available in sha256’s output:

const u32 a = ctx1.h[0];
const u32 b = ctx1.h[1];
const u32 c = ctx1.h[2];
const u32 d = ctx1.h[3];
const u32 e = ctx1.h[4];
const u32 a1 = ctx1.h[5];
const u32 b1 = ctx1.h[6];
const u32 c1 = ctx1.h[7];
u32 w0[4];
u32 w1[4];
u32 w2[4];
u32 w3[4];
w0[0] = uint_to_hex_lower8_le ((a >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a >> 24) & 255) << 16;
w0[1] = uint_to_hex_lower8_le ((a >> 0) & 255) << 0
| uint_to_hex_lower8_le ((a >> 8) & 255) << 16;
w0[2] = uint_to_hex_lower8_le ((b >> 16) & 255) << 0
| uint_to_hex_lower8_le ((b >> 24) & 255) << 16;
w0[3] = uint_to_hex_lower8_le ((b >> 0) & 255) << 0
| uint_to_hex_lower8_le ((b >> 8) & 255) << 16;
w1[0] = uint_to_hex_lower8_le ((c >> 16) & 255) << 0
| uint_to_hex_lower8_le ((c >> 24) & 255) << 16;
w1[1] = uint_to_hex_lower8_le ((c >> 0) & 255) << 0
| uint_to_hex_lower8_le ((c >> 8) & 255) << 16;
w1[2] = uint_to_hex_lower8_le ((d >> 16) & 255) << 0
| uint_to_hex_lower8_le ((d >> 24) & 255) << 16;
w1[3] = uint_to_hex_lower8_le ((d >> 0) & 255) << 0
| uint_to_hex_lower8_le ((d >> 8) & 255) << 16;
w2[0] = uint_to_hex_lower8_le ((e >> 16) & 255) << 0
| uint_to_hex_lower8_le ((e >> 24) & 255) << 16;
w2[1] = uint_to_hex_lower8_le ((e >> 0) & 255) << 0
| uint_to_hex_lower8_le ((e >> 8) & 255) << 16;
w2[2] = uint_to_hex_lower8_le ((a1 >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a1 >> 24) & 255) << 16;
w2[3] = uint_to_hex_lower8_le ((a1 >> 0) & 255) << 0
| uint_to_hex_lower8_le ((a1 >> 8) & 255) << 16;
w3[0] = uint_to_hex_lower8_le ((b1 >> 16) & 255) << 0
| uint_to_hex_lower8_le ((b1 >> 24) & 255) << 16;
w3[1] = uint_to_hex_lower8_le ((b1 >> 0) & 255) << 0
| uint_to_hex_lower8_le ((b1 >> 8) & 255) << 16;
w3[2] = uint_to_hex_lower8_le ((c1 >> 16) & 255) << 0
| uint_to_hex_lower8_le ((c1 >> 24) & 255) << 16;
w3[3] = uint_to_hex_lower8_le ((c1 >> 0) & 255) << 0
| uint_to_hex_lower8_le ((c1 >> 8) & 255) << 16;
sha256_update_64 (&ctx, w0, w1, w2, w3, 64); // prepend the internal sha256 hash.

In my case, it took a few attempts to get the bit shifting / byte ordering correct. Here’s what I did:

I knew, ultimately, that w0, w1, w2, and w3 needed to hold the ASCII representation of the sha256 hash, using our example peanut:

#echo -n peanut | shasum -a 256 | xxd
00000000: 3535 3039 3834 3064 3038 3733 6164 6230 5509840d0873adb0
00000010: 3430 3535 3838 3832 3131 3937 6138 3633 405588821197a863
00000020: 3435 3031 3239 3334 3836 6336 3031 6361 4501293486c601ca
00000030: 3531 6531 3430 3633 6162 6532 3564 3036 51e14063abe25d06
00000040: 2020 2d0a

The array should look like this:

w0[0] = 0x35353039;
w0[1] = 0x38343064;
w0[2] = 0x30383733;
w0[3] = 0x61646230;
w1[0] = 0x34303535;
w1[1] = 0x38383832;
w1[2] = 0x31313937;
w1[3] = 0x61383633;
w2[0] = 0x34353031;
w2[1] = 0x32393334;
w2[2] = 0x38366336;
w2[3] = 0x30316361;
w3[0] = 0x35316531;
w3[1] = 0x34303633;
w3[2] = 0x61626532;
w3[3] = 0x35643036;

These hard-coded values passed the Hashcat self-test—progress!

By hard-coding this array in the loop, we can play with byte ordering and bit shifting until w0[0] holds 0x35353039:

w0[1] = uint_to_hex_lower8_le ((a >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a >> 24) & 255) << 16; //0x38343064;
w0[2] = 0x30383733;
w0[3] = 0x61646230;
w1[0] = 0x34303535;
w1[1] = 0x38383832;
w1[2] = 0x31313937;
w1[3] = 0x61383633;
w2[0] = 0x34353031;
w2[1] = 0x32393334;
w2[2] = 0x38366336;
w2[3] = 0x30316361;
w3[0] = 0x35316531;
w3[1] = 0x34303633;
w3[2] = 0x61626532;
w3[3] = 0x35643036;
printf("\na = 0x%08x, w0[0]=0x%08x\n", a,w[0]); // tombstone.

You can debug the kernel by alternating rm -rf ./hashcat/kernels with your Hashcat single-hash test command. This was essential to getting the byte ordering and conversion correct.

With the right values in w0, w1, w2, and w3, we can move the _sxx changes over to _mxx. Just reapply your changes there while making sure the final comparison function is COMPARE_M instead of COMPARE_S.

Adding the mode

Next up, add the new mode to Hashcat.

SHA256_PW_SHA256_SLT was added as an enum to interface.h:

KERN_TYPE_SHA256_PWSLT = 1410,
KERN_TYPE_SHA256_PW_SHA256_SLT = 1415,
KERN_TYPE_SHA256_SLTPW = 1420,

HT_* appears to be CLI / help-related and was added to interface.c:

case 1411: return HT_01411;
case 1415: return HT_01415;
case 1420: return HT_01420;
static const char *HT_01410 = "sha256($pass.$salt)";
static const char *HT_01415 = "sha256(sha256($pass).$salt)";
static const char *HT_01420 = "sha256($salt.$pass)";

A test hash plus its corresponding plaintext are used for self-testing:

static const char *ST_PASS_HASHCAT_PEANUT = "peanut";
...
static const char *ST_HASH_01415 = "00050655d5d6b8a8c14d52e852ce930fcd38e0551f161d71999860bba72e52aa:b4d93efdf7899fed
0944a8bd19890a72764a9e169668d4c602fc6f1199eea449";
...
case 1411: hashconfig->hash_type = HASH_TYPE_SHA256;
...

And in the main hash configuration block, the values were lifted from 1410, sha256($plain.$salt).

An important bit here is to set kern_type properly. sha256s_parse_hash parses the hash list. If your hash’s format doesn’t match exactly, it will be rejected. Hashcat has a robust set of these parsing functions, but you’ll need to hunt through them. opti_type doesn’t seem to do anything.

break;
case 1415: hashconfig->hash_type = HASH_TYPE_SHA256;
hashconfig->salt_type = SALT_TYPE_GENERIC;
hashconfig->attack_exec = ATTACK_EXEC_INSIDE_KERNEL;
hashconfig->opts_type = OPTS_TYPE_PT_GENERATE_BE
| OPTS_TYPE_ST_ADD80
| OPTS_TYPE_ST_ADDBITS15;
hashconfig->kern_type = KERN_TYPE_SHA256_PW_SHA256_SLT;
hashconfig->dgst_size = DGST_SIZE_4_8;
hashconfig->parse_func = sha256s_parse_hash;
hashconfig->opti_type = OPTI_TYPE_ZERO_BYTE
| OPTI_TYPE_PRECOMPUTE_INIT
| OPTI_TYPE_PRECOMPUTE_MERKLE
| OPTI_TYPE_EARLY_SKIP
| OPTI_TYPE_NOT_ITERATED
| OPTI_TYPE_APPENDED_SALT
| OPTI_TYPE_RAW_HASH;
hashconfig->dgst_pos0 = 3;
hashconfig->dgst_pos1 = 7;
hashconfig->dgst_pos2 = 2;
hashconfig->dgst_pos3 = 6;
hashconfig->st_hash = ST_HASH_01415;
hashconfig->st_pass = ST_PASS_HASHCAT_PEANUT;
break;

Quick and dirty test

After a make clean && make, Hashcat should list your new type. Benchmark mode won’t be useful for testing, as the a0 kernels aren’t used here. Running a simple single-hash test against the new kernel will cause Hashcat to build it from the OpenCL directory to the kernel directory. This is when you’ll receive compile-time errors. There’s probably a way to directly invoke the OpenCL build toolchain to speed this up.

In this case, I had some known hash/salt/plaintext values (recoverable from JtR and publicly listed). These will be absolutely essential. It also helps to have the raw results of intermediate steps in the scheme for debugging.

I’ll use this example for testing:

00050655d5d6b8a8c14d52e852ce930fcd38e0551f161d71999860bba72e52aa:b4d93efdf789
9fed0944a8bd19890a72764a9e169668d4c602fc6f1199eea449:peanut

Steps from the command line:

#echo -n peanut | shasum -a 256
5509840d0873adb0405588821197a8634501293486c601ca51e14063abe25d06 -
#echo -n 5509840d0873adb0405588821197a8634501293486c601ca51e14063abe25d06b4d93efdf7
899fed0944a8bd19890a72764a9e169668d4c602fc6f1199eea449 | shasum -a 256
00050655d5d6b8a8c14d52e852ce930fcd38e0551f161d71999860bba72e52aa -

Hashcat test:

./hashcat -m 1415 -a 0 "00050655d5d6b8a8c14d52e852ce930fcd38e0551f161d71999860bba72e52aa:b4d93efdf7899fed0
944a8bd19890a72764a9e169668d4c602fc6f1199eea449" peanut.txt

hashcat (v4.0.1-123-g2095e27d+) starting...
OpenCL Platform #1: Apple

=========================

* Device #1: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz, skipped.
* Device #2: Intel(R) HD Graphics 530, 384/1536 MB allocatable, 24MCU
* Device #3: AMD Radeon Pro 460 Compute Engine, 1024/4096 MB allocatable, 16MCU

...
Watchdog: Temperature terminate trigger disabled.
* Device #2: ATTENTION! OpenCL kernel self-test failed.
Your device driver installation is probably broken.

See also: https://hashcat.net/faq/wrongdriver
* Device #3: ATTENTION! OpenCL kernel self-test failed.
Your device driver installation is probably broken.

See also: https://hashcat.net/faq/wrongdriver

Let’s get these self-tests to pass with this single hash.

Kicking out the jams

The majority of the work on an a0 kernel will be focused on these two functions:

__kernel void m01415_mxx
This function is invoked to handle cracking multiple target hashes at once. We can ignore this during testing.

__kernel void m01415_sxx
This function is invoked in single-hash mode (which is how we’re testing). We can focus our development effort here.

Let’s take a walk through the donor kernel’s __kernel void m01410_sxx function:

/**
* digest
*/
const u32 search[4] =
{
digests_buf[digests_offset].digest_buf[DGST_R0],
digests_buf[digests_offset].digest_buf[DGST_R1],
digests_buf[digests_offset].digest_buf[DGST_R2],
digests_buf[digests_offset].digest_buf[DGST_R3]
};

In this kernel, Hashcat doesn’t bother comparing all the bytes in the hash buffer—a basic optimization technique. It’s unlikely to collide with a human-meaningful candidate as selected by Hashcat, even when doing partial matches. The bytes are configured in the hashconfig struct:

hashconfig->dgst_pos0 = 3;
hashconfig->dgst_pos1 = 7;
hashconfig->dgst_pos2 = 2;
hashconfig->dgst_pos3 = 6;

It’s recommended that you use values from another kernel that has the same final hash round as your target.

/**
* base
*/
COPY_PW (pws[gid]);
const u32 salt_len = salt_bufs[salt_pos].salt_len;
u32 s[64] = { 0 };
for (int i = 0, idx = 0; i < salt_len; i += 4, idx += 1)
{
s[idx] = swap32_S (salt_bufs[salt_pos].salt_buf[idx]);
}

The kernel prepares the salt buffer once and carries out operations usually performed by the Hashcat sha256 library’s sha256_update_swap.

Now is a good time to browse through the methods in inc_hash_sha256.cl. These aren’t documented in any way, and while they resemble other low-level C crypto implementations, some things have been changed to permit more efficient cracking. It helps to select methods out of this library, and those in its hash family (e.g., sha512, sha1), to understand how these functions can be used in various contexts.

/**
* loop
*/
for (u32 il_pos = 0; il_pos < il_cnt; il_pos++)
{
pw_t tmp = PASTE_PW;
tmp.pw_len = apply_rules (rules_buf[il_pos].cmds, tmp.i, tmp.pw_len);
sha256_ctx_t ctx;
sha256_init (&ctx);
sha256_update_swap (&ctx, tmp.i, tmp.pw_len);
sha256_update (&ctx, s, salt_len);
sha256_final (&ctx);
const u32 r0 = ctx.h[DGST_R0];
const u32 r1 = ctx.h[DGST_R1];
const u32 r2 = ctx.h[DGST_R2];
const u32 r3 = ctx.h[DGST_R3];
COMPARE_S_SCALAR (r0, r1, r2, r3);
}

The function’s main loop permutes the password root candidate tmp.i according to rules identified in the rules_buf array. It then establishes a sha256 context, updates the context with the password candidate buffer, updates it (minus the swap already performed) with the salt, and then finalizes the hash. It then compares the bytes outlined in hashconfig for a match.

Because our source function is sha256($plain.$salt) and our target is sha256(sha256($plain).$salt), we know we need to add a round of sha256 to the plaintext before updating the final sha256 context, ctx.

What are some other modes that might carry out similar operations?

  • 40 | md5($salt.utf16le($pass)) | raw hash, salted, and/or iterated
  • 3710 | md5($salt.md5($pass)) | raw hash, salted, and/or iterated
  • 140 | sha1($salt.utf16le($pass)) | raw hash, salted, and/or iterated
  • 4520 | sha1($salt.sha1($pass)) | raw hash, salted, and/or iterated
  • 1440 | sha256($salt.utf16le($pass)) | raw hash, salted, and/or iterated
  • 1740 | sha512($salt.utf16le($pass)) | raw hash, salted, and/or iterated
  • 30 | md5(utf16le($pass).$salt) | raw hash, salted, and/or iterated
  • 130 | sha1(utf16le($pass).$salt) | raw hash, salted, and/or iterated
  • 1430 | sha256(utf16le($pass).$salt) | raw hash, salted, and/or iterated
  • 1730 | sha512(utf16le($pass).$salt) | raw hash, salted, and/or iterated

Some nice candidates include m02610_a0.cl, m04700_a0.cl, m04520_a0.cl, m04500_a0.cl, and m04400_a0.cl. This builds up more examples of how the Hashcat devs used the libraries and OpenCL environment in an efficient manner, and gives you some things to consume and understand.

It’s worth mentioning that Hashcat appears to follow a naming convention for these: xxxyy, where xxx denotes the family and yy is either 00 for unsalted, 10 for a post-pended salt, or 20 for a pre-pended salt. There are plenty of counterexamples, but this is the general scheme.

Let’s take a look at an example of a salted hash, with an extra round of hash on the plaintext. Thankfully, we have one in the same hash family: 4520, sha1($salt.sha1($pass)).

Here’s the core routine:

sha1_ctx_t ctx1;
sha1_init (&ctx1);
sha1_update_swap (&ctx1, tmp.i, tmp.pw_len);
sha1_final (&ctx1);
const u32 a = ctx1.h[0];
const u32 b = ctx1.h[1];
const u32 c = ctx1.h[2];
const u32 d = ctx1.h[3];
const u32 e = ctx1.h[4];
sha1_ctx_t ctx = ctx0;
u32 w0[4];
u32 w1[4];
u32 w2[4];
u32 w3[4];
w0[0] = uint_to_hex_lower8_le ((a >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a >> 24) & 255) << 16;
w0[1] = uint_to_hex_lower8_le ((a >> 0) & 255) << 0
| uint_to_hex_lower8_le ((a >> 8) & 255) << 16;
w0[2] = uint_to_hex_lower8_le ((b >> 16) & 255) << 0
| uint_to_hex_lower8_le ((b >> 24) & 255) << 16;
w0[3] = uint_to_hex_lower8_le ((b >> 0) & 255) << 0
| uint_to_hex_lower8_le ((b >> 8) & 255) << 16;
w1[0] = uint_to_hex_lower8_le ((c >> 16) & 255) << 0
| uint_to_hex_lower8_le ((c >> 24) & 255) << 16;
w1[1] = uint_to_hex_lower8_le ((c >> 0) & 255) << 0
| uint_to_hex_lower8_le ((c >> 8) & 255) << 16;
w1[2] = uint_to_hex_lower8_le ((d >> 16) & 255) << 0
| uint_to_hex_lower8_le ((d >> 24) & 255) << 16;
w1[3] = uint_to_hex_lower8_le ((d >> 0) & 255) << 0
| uint_to_hex_lower8_le ((d >> 8) & 255) << 16;
w2[0] = uint_to_hex_lower8_le ((e >> 16) & 255) << 0
| uint_to_hex_lower8_le ((e >> 24) & 255) << 16;
w2[1] = uint_to_hex_lower8_le ((e >> 0) & 255) << 0
| uint_to_hex_lower8_le ((e >> 8) & 255) << 16;
w2[2] = 0;
w2[3] = 0;
w3[0] = 0;
w3[1] = 0;
w3[2] = 0;
w3[3] = 0;
sha1_update_64 (&ctx, w0, w1, w2, w3, 40);
sha1_final (&ctx);

So we can see that this routine uses two sha1 contexts: ctx1 to contain sha1($pass) and ctx to contain sha1($salt.sha1($pass)). Additionally, this kernel introduces a new macro, uint_to_hex_lower8_le, and adds some logic to the _sxx function:

const u64 lsz = get_local_size (0);
/**
* bin2asc table
*/
__local u32 l_bin2asc[256];
for (u32 i = lid; i < 256; i += lsz)
{
const u32 i0 = (i >> 0) & 15;
const u32 i1 = (i >> 4) & 15;
l_bin2asc[i] = ((i0 < 10) ? '0' + i0 : 'a' - 10 + i0) << 0
| ((i1 < 10) ? '0' + i1 : 'a' - 10 + i1) << 8;
}

uint_to_hex_lower8_le looks up a byte (4 bits) in l_bin2asc, which returns a lowercase ASCII representation in 2 bytes (8 bits). We rely on this to convert the raw hex output of the first sha1 round to the lowercase ASCII representation that is usually composed in these salted routines. In this case, the routine assumes little-endian byte ordering, which is appropriate for the SHA family but can differ; for example, the MD5 Hashcat libs are big-endian.

The next step is understanding how to use this construct for sha256. We can find that sha1 has an output of 40 bytes, while sha256 outputs 64 bytes. Let’s set this up in our new kernel. First, add the uint_to_hex_lower8_le macro to the beginning of the file. Then add the routine to build the l_bin2asc array. Next, let’s change the core crack routine:

/**
* loop
*/
for (u32 il_pos = 0; il_pos < il_cnt; il_pos++)
{
pw_t tmp = PASTE_PW;
tmp.pw_len = apply_rules (rules_buf[il_pos].cmds, tmp.i, tmp.pw_len);
sha256_ctx_t ctx; //containing the final context
sha256_init (&ctx);
sha256_ctx_t ctx1; //containing the sha256($pass) context
sha256_init (&ctx1);
sha256_update_swap (&ctx1, tmp.i, tmp.pw_len);
sha256_final (&ctx1); //finalize the sha256 rounds.
const u32 a = ctx1.h[0];
const u32 b = ctx1.h[1];
const u32 c = ctx1.h[2];
const u32 d = ctx1.h[3];
const u32 e = ctx1.h[4];
u32 w0[4];
u32 w1[4];
u32 w2[4];
u32 w3[4];
w0[0] = uint_to_hex_lower8_le ((a >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a >> 24) & 255) << 16;
w0[1] = uint_to_hex_lower8_le ((a >> 0) & 255) << 0
| uint_to_hex_lower8_le ((a >> 8) & 255) << 16;
w0[2] = uint_to_hex_lower8_le ((b >> 16) & 255) << 0
| uint_to_hex_lower8_le ((b >> 24) & 255) << 16;
w0[3] = uint_to_hex_lower8_le ((b >> 0) & 255) << 0
| uint_to_hex_lower8_le ((b >> 8) & 255) << 16;
w1[0] = uint_to_hex_lower8_le ((c >> 16) & 255) << 0
| uint_to_hex_lower8_le ((c >> 24) & 255) << 16;
w1[1] = uint_to_hex_lower8_le ((c >> 0) & 255) << 0
| uint_to_hex_lower8_le ((c >> 8) & 255) << 16;
w1[2] = uint_to_hex_lower8_le ((d >> 16) & 255) << 0
| uint_to_hex_lower8_le ((d >> 24) & 255) << 16;
w1[3] = uint_to_hex_lower8_le ((d >> 0) & 255) << 0
| uint_to_hex_lower8_le ((d >> 8) & 255) << 16;
w2[0] = uint_to_hex_lower8_le ((e >> 16) & 255) << 0
| uint_to_hex_lower8_le ((e >> 24) & 255) << 16;
w2[1] = uint_to_hex_lower8_le ((e >> 0) & 255) << 0
| uint_to_hex_lower8_le ((e >> 8) & 255) << 16;
w2[2] = 0;
w2[3] = 0;
w3[0] = 0;
w3[1] = 0;
w3[2] = 0;
w3[3] = 0;
sha256_update_64 (&ctx, w0, w1, w2, w3, 40); // prepend the internal sha256 hash.
sha256_update (&ctx, s, salt_len); // add the salt
sha256_final (&ctx); // finalize
const u32 r0 = ctx.h[DGST_R0];
const u32 r1 = ctx.h[DGST_R1];
const u32 r2 = ctx.h[DGST_R2];
const u32 r3 = ctx.h[DGST_R3];
COMPARE_S_SCALAR (r0, r1, r2, r3);
}

Next up, we need to address the extra bytes available in sha256’s output:

const u32 a = ctx1.h[0];
const u32 b = ctx1.h[1];
const u32 c = ctx1.h[2];
const u32 d = ctx1.h[3];
const u32 e = ctx1.h[4];
const u32 a1 = ctx1.h[5];
const u32 b1 = ctx1.h[6];
const u32 c1 = ctx1.h[7];
u32 w0[4];
u32 w1[4];
u32 w2[4];
u32 w3[4];
w0[0] = uint_to_hex_lower8_le ((a >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a >> 24) & 255) << 16;
w0[1] = uint_to_hex_lower8_le ((a >> 0) & 255) << 0
| uint_to_hex_lower8_le ((a >> 8) & 255) << 16;
w0[2] = uint_to_hex_lower8_le ((b >> 16) & 255) << 0
| uint_to_hex_lower8_le ((b >> 24) & 255) << 16;
w0[3] = uint_to_hex_lower8_le ((b >> 0) & 255) << 0
| uint_to_hex_lower8_le ((b >> 8) & 255) << 16;
w1[0] = uint_to_hex_lower8_le ((c >> 16) & 255) << 0
| uint_to_hex_lower8_le ((c >> 24) & 255) << 16;
w1[1] = uint_to_hex_lower8_le ((c >> 0) & 255) << 0
| uint_to_hex_lower8_le ((c >> 8) & 255) << 16;
w1[2] = uint_to_hex_lower8_le ((d >> 16) & 255) << 0
| uint_to_hex_lower8_le ((d >> 24) & 255) << 16;
w1[3] = uint_to_hex_lower8_le ((d >> 0) & 255) << 0
| uint_to_hex_lower8_le ((d >> 8) & 255) << 16;
w2[0] = uint_to_hex_lower8_le ((e >> 16) & 255) << 0
| uint_to_hex_lower8_le ((e >> 24) & 255) << 16;
w2[1] = uint_to_hex_lower8_le ((e >> 0) & 255) << 0
| uint_to_hex_lower8_le ((e >> 8) & 255) << 16;
w2[2] = uint_to_hex_lower8_le ((a1 >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a1 >> 24) & 255) << 16;
w2[3] = uint_to_hex_lower8_le ((a1 >> 0) & 255) << 0
| uint_to_hex_lower8_le ((a1 >> 8) & 255) << 16;
w3[0] = uint_to_hex_lower8_le ((b1 >> 16) & 255) << 0
| uint_to_hex_lower8_le ((b1 >> 24) & 255) << 16;
w3[1] = uint_to_hex_lower8_le ((b1 >> 0) & 255) << 0
| uint_to_hex_lower8_le ((b1 >> 8) & 255) << 16;
w3[2] = uint_to_hex_lower8_le ((c1 >> 16) & 255) << 0
| uint_to_hex_lower8_le ((c1 >> 24) & 255) << 16;
w3[3] = uint_to_hex_lower8_le ((c1 >> 0) & 255) << 0
| uint_to_hex_lower8_le ((c1 >> 8) & 255) << 16;
sha256_update_64 (&ctx, w0, w1, w2, w3, 64); // prepend the internal sha256 hash.

In my case, it took a few attempts to get the bit shifting / byte ordering correct. Here’s what I did:

I knew, ultimately, that w0, w1, w2, and w3 needed to hold the ASCII representation of the sha256 hash, using our example peanut:

#echo -n peanut | shasum -a 256 | xxd
00000000: 3535 3039 3834 3064 3038 3733 6164 6230 5509840d0873adb0
00000010: 3430 3535 3838 3832 3131 3937 6138 3633 405588821197a863
00000020: 3435 3031 3239 3334 3836 6336 3031 6361 4501293486c601ca
00000030: 3531 6531 3430 3633 6162 6532 3564 3036 51e14063abe25d06
00000040: 2020 2d0a

The array should look like this:

w0[0] = 0x35353039;
w0[1] = 0x38343064;
w0[2] = 0x30383733;
w0[3] = 0x61646230;
w1[0] = 0x34303535;
w1[1] = 0x38383832;
w1[2] = 0x31313937;
w1[3] = 0x61383633;
w2[0] = 0x34353031;
w2[1] = 0x32393334;
w2[2] = 0x38366336;
w2[3] = 0x30316361;
w3[0] = 0x35316531;
w3[1] = 0x34303633;
w3[2] = 0x61626532;
w3[3] = 0x35643036;

These hard-coded values passed the Hashcat self-test—progress!

By hard-coding this array in the loop, we can play with byte ordering and bit shifting until w0[0] holds 0x35353039:

w0[1] = uint_to_hex_lower8_le ((a >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a >> 24) & 255) << 16; //0x38343064;
w0[2] = 0x30383733;
w0[3] = 0x61646230;
w1[0] = 0x34303535;
w1[1] = 0x38383832;
w1[2] = 0x31313937;
w1[3] = 0x61383633;
w2[0] = 0x34353031;
w2[1] = 0x32393334;
w2[2] = 0x38366336;
w2[3] = 0x30316361;
w3[0] = 0x35316531;
w3[1] = 0x34303633;
w3[2] = 0x61626532;
w3[3] = 0x35643036;
printf("\na = 0x%08x, w0[0]=0x%08x\n", a,w[0]); // tombstone.

You can debug the kernel by alternating rm -rf ./hashcat/kernels with your Hashcat single-hash test command. This was essential to getting the byte ordering and conversion correct.

With the right values in w0, w1, w2, and w3, we can move the _sxx changes over to _mxx. Just reapply your changes there while making sure the final comparison function is COMPARE_M instead of COMPARE_S.

Adding the mode

Next up, add the new mode to Hashcat.

SHA256_PW_SHA256_SLT was added as an enum to interface.h:

KERN_TYPE_SHA256_PWSLT = 1410,
KERN_TYPE_SHA256_PW_SHA256_SLT = 1415,
KERN_TYPE_SHA256_SLTPW = 1420,

HT_* appears to be CLI / help-related and was added to interface.c:

case 1411: return HT_01411;
case 1415: return HT_01415;
case 1420: return HT_01420;
static const char *HT_01410 = "sha256($pass.$salt)";
static const char *HT_01415 = "sha256(sha256($pass).$salt)";
static const char *HT_01420 = "sha256($salt.$pass)";

A test hash plus its corresponding plaintext are used for self-testing:

static const char *ST_PASS_HASHCAT_PEANUT = "peanut";
...
static const char *ST_HASH_01415 = "00050655d5d6b8a8c14d52e852ce930fcd38e0551f161d71999860bba72e52aa:b4d93efdf7899fed
0944a8bd19890a72764a9e169668d4c602fc6f1199eea449";
...
case 1411: hashconfig->hash_type = HASH_TYPE_SHA256;
...

And in the main hash configuration block, the values were lifted from 1410, sha256($plain.$salt).

An important bit here is to set kern_type properly. sha256s_parse_hash parses the hash list. If your hash’s format doesn’t match exactly, it will be rejected. Hashcat has a robust set of these parsing functions, but you’ll need to hunt through them. opti_type doesn’t seem to do anything.

break;
case 1415: hashconfig->hash_type = HASH_TYPE_SHA256;
hashconfig->salt_type = SALT_TYPE_GENERIC;
hashconfig->attack_exec = ATTACK_EXEC_INSIDE_KERNEL;
hashconfig->opts_type = OPTS_TYPE_PT_GENERATE_BE
| OPTS_TYPE_ST_ADD80
| OPTS_TYPE_ST_ADDBITS15;
hashconfig->kern_type = KERN_TYPE_SHA256_PW_SHA256_SLT;
hashconfig->dgst_size = DGST_SIZE_4_8;
hashconfig->parse_func = sha256s_parse_hash;
hashconfig->opti_type = OPTI_TYPE_ZERO_BYTE
| OPTI_TYPE_PRECOMPUTE_INIT
| OPTI_TYPE_PRECOMPUTE_MERKLE
| OPTI_TYPE_EARLY_SKIP
| OPTI_TYPE_NOT_ITERATED
| OPTI_TYPE_APPENDED_SALT
| OPTI_TYPE_RAW_HASH;
hashconfig->dgst_pos0 = 3;
hashconfig->dgst_pos1 = 7;
hashconfig->dgst_pos2 = 2;
hashconfig->dgst_pos3 = 6;
hashconfig->st_hash = ST_HASH_01415;
hashconfig->st_pass = ST_PASS_HASHCAT_PEANUT;
break;

Quick and dirty test

After a make clean && make, Hashcat should list your new type. Benchmark mode won’t be useful for testing, as the a0 kernels aren’t used here. Running a simple single-hash test against the new kernel will cause Hashcat to build it from the OpenCL directory to the kernel directory. This is when you’ll receive compile-time errors. There’s probably a way to directly invoke the OpenCL build toolchain to speed this up.

In this case, I had some known hash/salt/plaintext values (recoverable from JtR and publicly listed). These will be absolutely essential. It also helps to have the raw results of intermediate steps in the scheme for debugging.

I’ll use this example for testing:

00050655d5d6b8a8c14d52e852ce930fcd38e0551f161d71999860bba72e52aa:b4d93efdf789
9fed0944a8bd19890a72764a9e169668d4c602fc6f1199eea449:peanut

Steps from the command line:

#echo -n peanut | shasum -a 256
5509840d0873adb0405588821197a8634501293486c601ca51e14063abe25d06 -
#echo -n 5509840d0873adb0405588821197a8634501293486c601ca51e14063abe25d06b4d93efdf7
899fed0944a8bd19890a72764a9e169668d4c602fc6f1199eea449 | shasum -a 256
00050655d5d6b8a8c14d52e852ce930fcd38e0551f161d71999860bba72e52aa -

Hashcat test:

./hashcat -m 1415 -a 0 "00050655d5d6b8a8c14d52e852ce930fcd38e0551f161d71999860bba72e52aa:b4d93efdf7899fed0
944a8bd19890a72764a9e169668d4c602fc6f1199eea449" peanut.txt

hashcat (v4.0.1-123-g2095e27d+) starting...
OpenCL Platform #1: Apple

=========================

* Device #1: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz, skipped.
* Device #2: Intel(R) HD Graphics 530, 384/1536 MB allocatable, 24MCU
* Device #3: AMD Radeon Pro 460 Compute Engine, 1024/4096 MB allocatable, 16MCU

...
Watchdog: Temperature terminate trigger disabled.
* Device #2: ATTENTION! OpenCL kernel self-test failed.
Your device driver installation is probably broken.

See also: https://hashcat.net/faq/wrongdriver
* Device #3: ATTENTION! OpenCL kernel self-test failed.
Your device driver installation is probably broken.

See also: https://hashcat.net/faq/wrongdriver

Let’s get these self-tests to pass with this single hash.

Kicking out the jams

The majority of the work on an a0 kernel will be focused on these two functions:

__kernel void m01415_mxx
This function is invoked to handle cracking multiple target hashes at once. We can ignore this during testing.

__kernel void m01415_sxx
This function is invoked in single-hash mode (which is how we’re testing). We can focus our development effort here.

Let’s take a walk through the donor kernel’s __kernel void m01410_sxx function:

/**
* digest
*/
const u32 search[4] =
{
digests_buf[digests_offset].digest_buf[DGST_R0],
digests_buf[digests_offset].digest_buf[DGST_R1],
digests_buf[digests_offset].digest_buf[DGST_R2],
digests_buf[digests_offset].digest_buf[DGST_R3]
};

In this kernel, Hashcat doesn’t bother comparing all the bytes in the hash buffer—a basic optimization technique. It’s unlikely to collide with a human-meaningful candidate as selected by Hashcat, even when doing partial matches. The bytes are configured in the hashconfig struct:

hashconfig->dgst_pos0 = 3;
hashconfig->dgst_pos1 = 7;
hashconfig->dgst_pos2 = 2;
hashconfig->dgst_pos3 = 6;

It’s recommended that you use values from another kernel that has the same final hash round as your target.

/**
* base
*/
COPY_PW (pws[gid]);
const u32 salt_len = salt_bufs[salt_pos].salt_len;
u32 s[64] = { 0 };
for (int i = 0, idx = 0; i < salt_len; i += 4, idx += 1)
{
s[idx] = swap32_S (salt_bufs[salt_pos].salt_buf[idx]);
}

The kernel prepares the salt buffer once and carries out operations usually performed by the Hashcat sha256 library’s sha256_update_swap.

Now is a good time to browse through the methods in inc_hash_sha256.cl. These aren’t documented in any way, and while they resemble other low-level C crypto implementations, some things have been changed to permit more efficient cracking. It helps to select methods out of this library, and those in its hash family (e.g., sha512, sha1), to understand how these functions can be used in various contexts.

/**
* loop
*/
for (u32 il_pos = 0; il_pos < il_cnt; il_pos++)
{
pw_t tmp = PASTE_PW;
tmp.pw_len = apply_rules (rules_buf[il_pos].cmds, tmp.i, tmp.pw_len);
sha256_ctx_t ctx;
sha256_init (&ctx);
sha256_update_swap (&ctx, tmp.i, tmp.pw_len);
sha256_update (&ctx, s, salt_len);
sha256_final (&ctx);
const u32 r0 = ctx.h[DGST_R0];
const u32 r1 = ctx.h[DGST_R1];
const u32 r2 = ctx.h[DGST_R2];
const u32 r3 = ctx.h[DGST_R3];
COMPARE_S_SCALAR (r0, r1, r2, r3);
}

The function’s main loop permutes the password root candidate tmp.i according to rules identified in the rules_buf array. It then establishes a sha256 context, updates the context with the password candidate buffer, updates it (minus the swap already performed) with the salt, and then finalizes the hash. It then compares the bytes outlined in hashconfig for a match.

Because our source function is sha256($plain.$salt) and our target is sha256(sha256($plain).$salt), we know we need to add a round of sha256 to the plaintext before updating the final sha256 context, ctx.

What are some other modes that might carry out similar operations?

  • 40 | md5($salt.utf16le($pass)) | raw hash, salted, and/or iterated
  • 3710 | md5($salt.md5($pass)) | raw hash, salted, and/or iterated
  • 140 | sha1($salt.utf16le($pass)) | raw hash, salted, and/or iterated
  • 4520 | sha1($salt.sha1($pass)) | raw hash, salted, and/or iterated
  • 1440 | sha256($salt.utf16le($pass)) | raw hash, salted, and/or iterated
  • 1740 | sha512($salt.utf16le($pass)) | raw hash, salted, and/or iterated
  • 30 | md5(utf16le($pass).$salt) | raw hash, salted, and/or iterated
  • 130 | sha1(utf16le($pass).$salt) | raw hash, salted, and/or iterated
  • 1430 | sha256(utf16le($pass).$salt) | raw hash, salted, and/or iterated
  • 1730 | sha512(utf16le($pass).$salt) | raw hash, salted, and/or iterated

Some nice candidates include m02610_a0.cl, m04700_a0.cl, m04520_a0.cl, m04500_a0.cl, and m04400_a0.cl. This builds up more examples of how the Hashcat devs used the libraries and OpenCL environment in an efficient manner, and gives you some things to consume and understand.

It’s worth mentioning that Hashcat appears to follow a naming convention for these: xxxyy, where xxx denotes the family and yy is either 00 for unsalted, 10 for a post-pended salt, or 20 for a pre-pended salt. There are plenty of counterexamples, but this is the general scheme.

Let’s take a look at an example of a salted hash, with an extra round of hash on the plaintext. Thankfully, we have one in the same hash family: 4520, sha1($salt.sha1($pass)).

Here’s the core routine:

sha1_ctx_t ctx1;
sha1_init (&ctx1);
sha1_update_swap (&ctx1, tmp.i, tmp.pw_len);
sha1_final (&ctx1);
const u32 a = ctx1.h[0];
const u32 b = ctx1.h[1];
const u32 c = ctx1.h[2];
const u32 d = ctx1.h[3];
const u32 e = ctx1.h[4];
sha1_ctx_t ctx = ctx0;
u32 w0[4];
u32 w1[4];
u32 w2[4];
u32 w3[4];
w0[0] = uint_to_hex_lower8_le ((a >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a >> 24) & 255) << 16;
w0[1] = uint_to_hex_lower8_le ((a >> 0) & 255) << 0
| uint_to_hex_lower8_le ((a >> 8) & 255) << 16;
w0[2] = uint_to_hex_lower8_le ((b >> 16) & 255) << 0
| uint_to_hex_lower8_le ((b >> 24) & 255) << 16;
w0[3] = uint_to_hex_lower8_le ((b >> 0) & 255) << 0
| uint_to_hex_lower8_le ((b >> 8) & 255) << 16;
w1[0] = uint_to_hex_lower8_le ((c >> 16) & 255) << 0
| uint_to_hex_lower8_le ((c >> 24) & 255) << 16;
w1[1] = uint_to_hex_lower8_le ((c >> 0) & 255) << 0
| uint_to_hex_lower8_le ((c >> 8) & 255) << 16;
w1[2] = uint_to_hex_lower8_le ((d >> 16) & 255) << 0
| uint_to_hex_lower8_le ((d >> 24) & 255) << 16;
w1[3] = uint_to_hex_lower8_le ((d >> 0) & 255) << 0
| uint_to_hex_lower8_le ((d >> 8) & 255) << 16;
w2[0] = uint_to_hex_lower8_le ((e >> 16) & 255) << 0
| uint_to_hex_lower8_le ((e >> 24) & 255) << 16;
w2[1] = uint_to_hex_lower8_le ((e >> 0) & 255) << 0
| uint_to_hex_lower8_le ((e >> 8) & 255) << 16;
w2[2] = 0;
w2[3] = 0;
w3[0] = 0;
w3[1] = 0;
w3[2] = 0;
w3[3] = 0;
sha1_update_64 (&ctx, w0, w1, w2, w3, 40);
sha1_final (&ctx);

So we can see that this routine uses two sha1 contexts: ctx1 to contain sha1($pass) and ctx to contain sha1($salt.sha1($pass)). Additionally, this kernel introduces a new macro, uint_to_hex_lower8_le, and adds some logic to the _sxx function:

const u64 lsz = get_local_size (0);
/**
* bin2asc table
*/
__local u32 l_bin2asc[256];
for (u32 i = lid; i < 256; i += lsz)
{
const u32 i0 = (i >> 0) & 15;
const u32 i1 = (i >> 4) & 15;
l_bin2asc[i] = ((i0 < 10) ? '0' + i0 : 'a' - 10 + i0) << 0
| ((i1 < 10) ? '0' + i1 : 'a' - 10 + i1) << 8;
}

uint_to_hex_lower8_le looks up a byte (4 bits) in l_bin2asc, which returns a lowercase ASCII representation in 2 bytes (8 bits). We rely on this to convert the raw hex output of the first sha1 round to the lowercase ASCII representation that is usually composed in these salted routines. In this case, the routine assumes little-endian byte ordering, which is appropriate for the SHA family but can differ; for example, the MD5 Hashcat libs are big-endian.

The next step is understanding how to use this construct for sha256. We can find that sha1 has an output of 40 bytes, while sha256 outputs 64 bytes. Let’s set this up in our new kernel. First, add the uint_to_hex_lower8_le macro to the beginning of the file. Then add the routine to build the l_bin2asc array. Next, let’s change the core crack routine:

/**
* loop
*/
for (u32 il_pos = 0; il_pos < il_cnt; il_pos++)
{
pw_t tmp = PASTE_PW;
tmp.pw_len = apply_rules (rules_buf[il_pos].cmds, tmp.i, tmp.pw_len);
sha256_ctx_t ctx; //containing the final context
sha256_init (&ctx);
sha256_ctx_t ctx1; //containing the sha256($pass) context
sha256_init (&ctx1);
sha256_update_swap (&ctx1, tmp.i, tmp.pw_len);
sha256_final (&ctx1); //finalize the sha256 rounds.
const u32 a = ctx1.h[0];
const u32 b = ctx1.h[1];
const u32 c = ctx1.h[2];
const u32 d = ctx1.h[3];
const u32 e = ctx1.h[4];
u32 w0[4];
u32 w1[4];
u32 w2[4];
u32 w3[4];
w0[0] = uint_to_hex_lower8_le ((a >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a >> 24) & 255) << 16;
w0[1] = uint_to_hex_lower8_le ((a >> 0) & 255) << 0
| uint_to_hex_lower8_le ((a >> 8) & 255) << 16;
w0[2] = uint_to_hex_lower8_le ((b >> 16) & 255) << 0
| uint_to_hex_lower8_le ((b >> 24) & 255) << 16;
w0[3] = uint_to_hex_lower8_le ((b >> 0) & 255) << 0
| uint_to_hex_lower8_le ((b >> 8) & 255) << 16;
w1[0] = uint_to_hex_lower8_le ((c >> 16) & 255) << 0
| uint_to_hex_lower8_le ((c >> 24) & 255) << 16;
w1[1] = uint_to_hex_lower8_le ((c >> 0) & 255) << 0
| uint_to_hex_lower8_le ((c >> 8) & 255) << 16;
w1[2] = uint_to_hex_lower8_le ((d >> 16) & 255) << 0
| uint_to_hex_lower8_le ((d >> 24) & 255) << 16;
w1[3] = uint_to_hex_lower8_le ((d >> 0) & 255) << 0
| uint_to_hex_lower8_le ((d >> 8) & 255) << 16;
w2[0] = uint_to_hex_lower8_le ((e >> 16) & 255) << 0
| uint_to_hex_lower8_le ((e >> 24) & 255) << 16;
w2[1] = uint_to_hex_lower8_le ((e >> 0) & 255) << 0
| uint_to_hex_lower8_le ((e >> 8) & 255) << 16;
w2[2] = 0;
w2[3] = 0;
w3[0] = 0;
w3[1] = 0;
w3[2] = 0;
w3[3] = 0;
sha256_update_64 (&ctx, w0, w1, w2, w3, 40); // prepend the internal sha256 hash.
sha256_update (&ctx, s, salt_len); // add the salt
sha256_final (&ctx); // finalize
const u32 r0 = ctx.h[DGST_R0];
const u32 r1 = ctx.h[DGST_R1];
const u32 r2 = ctx.h[DGST_R2];
const u32 r3 = ctx.h[DGST_R3];
COMPARE_S_SCALAR (r0, r1, r2, r3);
}

Next up, we need to address the extra bytes available in sha256’s output:

const u32 a = ctx1.h[0];
const u32 b = ctx1.h[1];
const u32 c = ctx1.h[2];
const u32 d = ctx1.h[3];
const u32 e = ctx1.h[4];
const u32 a1 = ctx1.h[5];
const u32 b1 = ctx1.h[6];
const u32 c1 = ctx1.h[7];
u32 w0[4];
u32 w1[4];
u32 w2[4];
u32 w3[4];
w0[0] = uint_to_hex_lower8_le ((a >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a >> 24) & 255) << 16;
w0[1] = uint_to_hex_lower8_le ((a >> 0) & 255) << 0
| uint_to_hex_lower8_le ((a >> 8) & 255) << 16;
w0[2] = uint_to_hex_lower8_le ((b >> 16) & 255) << 0
| uint_to_hex_lower8_le ((b >> 24) & 255) << 16;
w0[3] = uint_to_hex_lower8_le ((b >> 0) & 255) << 0
| uint_to_hex_lower8_le ((b >> 8) & 255) << 16;
w1[0] = uint_to_hex_lower8_le ((c >> 16) & 255) << 0
| uint_to_hex_lower8_le ((c >> 24) & 255) << 16;
w1[1] = uint_to_hex_lower8_le ((c >> 0) & 255) << 0
| uint_to_hex_lower8_le ((c >> 8) & 255) << 16;
w1[2] = uint_to_hex_lower8_le ((d >> 16) & 255) << 0
| uint_to_hex_lower8_le ((d >> 24) & 255) << 16;
w1[3] = uint_to_hex_lower8_le ((d >> 0) & 255) << 0
| uint_to_hex_lower8_le ((d >> 8) & 255) << 16;
w2[0] = uint_to_hex_lower8_le ((e >> 16) & 255) << 0
| uint_to_hex_lower8_le ((e >> 24) & 255) << 16;
w2[1] = uint_to_hex_lower8_le ((e >> 0) & 255) << 0
| uint_to_hex_lower8_le ((e >> 8) & 255) << 16;
w2[2] = uint_to_hex_lower8_le ((a1 >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a1 >> 24) & 255) << 16;
w2[3] = uint_to_hex_lower8_le ((a1 >> 0) & 255) << 0
| uint_to_hex_lower8_le ((a1 >> 8) & 255) << 16;
w3[0] = uint_to_hex_lower8_le ((b1 >> 16) & 255) << 0
| uint_to_hex_lower8_le ((b1 >> 24) & 255) << 16;
w3[1] = uint_to_hex_lower8_le ((b1 >> 0) & 255) << 0
| uint_to_hex_lower8_le ((b1 >> 8) & 255) << 16;
w3[2] = uint_to_hex_lower8_le ((c1 >> 16) & 255) << 0
| uint_to_hex_lower8_le ((c1 >> 24) & 255) << 16;
w3[3] = uint_to_hex_lower8_le ((c1 >> 0) & 255) << 0
| uint_to_hex_lower8_le ((c1 >> 8) & 255) << 16;
sha256_update_64 (&ctx, w0, w1, w2, w3, 64); // prepend the internal sha256 hash.

In my case, it took a few attempts to get the bit shifting / byte ordering correct. Here’s what I did:

I knew, ultimately, that w0, w1, w2, and w3 needed to hold the ASCII representation of the sha256 hash, using our example peanut:

#echo -n peanut | shasum -a 256 | xxd
00000000: 3535 3039 3834 3064 3038 3733 6164 6230 5509840d0873adb0
00000010: 3430 3535 3838 3832 3131 3937 6138 3633 405588821197a863
00000020: 3435 3031 3239 3334 3836 6336 3031 6361 4501293486c601ca
00000030: 3531 6531 3430 3633 6162 6532 3564 3036 51e14063abe25d06
00000040: 2020 2d0a

The array should look like this:

w0[0] = 0x35353039;
w0[1] = 0x38343064;
w0[2] = 0x30383733;
w0[3] = 0x61646230;
w1[0] = 0x34303535;
w1[1] = 0x38383832;
w1[2] = 0x31313937;
w1[3] = 0x61383633;
w2[0] = 0x34353031;
w2[1] = 0x32393334;
w2[2] = 0x38366336;
w2[3] = 0x30316361;
w3[0] = 0x35316531;
w3[1] = 0x34303633;
w3[2] = 0x61626532;
w3[3] = 0x35643036;

These hard-coded values passed the Hashcat self-test—progress!

By hard-coding this array in the loop, we can play with byte ordering and bit shifting until w0[0] holds 0x35353039:

w0[1] = uint_to_hex_lower8_le ((a >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a >> 24) & 255) << 16; //0x38343064;
w0[2] = 0x30383733;
w0[3] = 0x61646230;
w1[0] = 0x34303535;
w1[1] = 0x38383832;
w1[2] = 0x31313937;
w1[3] = 0x61383633;
w2[0] = 0x34353031;
w2[1] = 0x32393334;
w2[2] = 0x38366336;
w2[3] = 0x30316361;
w3[0] = 0x35316531;
w3[1] = 0x34303633;
w3[2] = 0x61626532;
w3[3] = 0x35643036;
printf("\na = 0x%08x, w0[0]=0x%08x\n", a,w[0]); // tombstone.

You can debug the kernel by alternating rm -rf ./hashcat/kernels with your Hashcat single-hash test command. This was essential to getting the byte ordering and conversion correct.

With the right values in w0, w1, w2, and w3, we can move the _sxx changes over to _mxx. Just reapply your changes there while making sure the final comparison function is COMPARE_M instead of COMPARE_S.

Kicking out the jams

The majority of the work on an a0 kernel will be focused on these two functions:

__kernel void m01415_mxx
This function is invoked to handle cracking multiple target hashes at once. We can ignore this during testing.

__kernel void m01415_sxx
This function is invoked in single-hash mode (which is how we’re testing). We can focus our development effort here.

Let’s take a walk through the donor kernel’s __kernel void m01410_sxx function:

/**
* digest
*/
const u32 search[4] =
{
digests_buf[digests_offset].digest_buf[DGST_R0],
digests_buf[digests_offset].digest_buf[DGST_R1],
digests_buf[digests_offset].digest_buf[DGST_R2],
digests_buf[digests_offset].digest_buf[DGST_R3]
};

In this kernel, Hashcat doesn’t bother comparing all the bytes in the hash buffer—a basic optimization technique. It’s unlikely to collide with a human-meaningful candidate as selected by Hashcat, even when doing partial matches. The bytes are configured in the hashconfig struct:

hashconfig->dgst_pos0 = 3;
hashconfig->dgst_pos1 = 7;
hashconfig->dgst_pos2 = 2;
hashconfig->dgst_pos3 = 6;

It’s recommended that you use values from another kernel that has the same final hash round as your target.

/**
* base
*/
COPY_PW (pws[gid]);
const u32 salt_len = salt_bufs[salt_pos].salt_len;
u32 s[64] = { 0 };
for (int i = 0, idx = 0; i < salt_len; i += 4, idx += 1)
{
s[idx] = swap32_S (salt_bufs[salt_pos].salt_buf[idx]);
}

The kernel prepares the salt buffer once and carries out operations usually performed by the Hashcat sha256 library’s sha256_update_swap.

Now is a good time to browse through the methods in inc_hash_sha256.cl. These aren’t documented in any way, and while they resemble other low-level C crypto implementations, some things have been changed to permit more efficient cracking. It helps to select methods out of this library, and those in its hash family (e.g., sha512, sha1), to understand how these functions can be used in various contexts.

/**
* loop
*/
for (u32 il_pos = 0; il_pos < il_cnt; il_pos++)
{
pw_t tmp = PASTE_PW;
tmp.pw_len = apply_rules (rules_buf[il_pos].cmds, tmp.i, tmp.pw_len);
sha256_ctx_t ctx;
sha256_init (&ctx);
sha256_update_swap (&ctx, tmp.i, tmp.pw_len);
sha256_update (&ctx, s, salt_len);
sha256_final (&ctx);
const u32 r0 = ctx.h[DGST_R0];
const u32 r1 = ctx.h[DGST_R1];
const u32 r2 = ctx.h[DGST_R2];
const u32 r3 = ctx.h[DGST_R3];
COMPARE_S_SCALAR (r0, r1, r2, r3);
}

The function’s main loop permutes the password root candidate tmp.i according to rules identified in the rules_buf array. It then establishes a sha256 context, updates the context with the password candidate buffer, updates it (minus the swap already performed) with the salt, and then finalizes the hash. It then compares the bytes outlined in hashconfig for a match.

Because our source function is sha256($plain.$salt) and our target is sha256(sha256($plain).$salt), we know we need to add a round of sha256 to the plaintext before updating the final sha256 context, ctx.

What are some other modes that might carry out similar operations?

  • 40 | md5($salt.utf16le($pass)) | raw hash, salted, and/or iterated
  • 3710 | md5($salt.md5($pass)) | raw hash, salted, and/or iterated
  • 140 | sha1($salt.utf16le($pass)) | raw hash, salted, and/or iterated
  • 4520 | sha1($salt.sha1($pass)) | raw hash, salted, and/or iterated
  • 1440 | sha256($salt.utf16le($pass)) | raw hash, salted, and/or iterated
  • 1740 | sha512($salt.utf16le($pass)) | raw hash, salted, and/or iterated
  • 30 | md5(utf16le($pass).$salt) | raw hash, salted, and/or iterated
  • 130 | sha1(utf16le($pass).$salt) | raw hash, salted, and/or iterated
  • 1430 | sha256(utf16le($pass).$salt) | raw hash, salted, and/or iterated
  • 1730 | sha512(utf16le($pass).$salt) | raw hash, salted, and/or iterated

Some nice candidates include m02610_a0.cl, m04700_a0.cl, m04520_a0.cl, m04500_a0.cl, and m04400_a0.cl. This builds up more examples of how the Hashcat devs used the libraries and OpenCL environment in an efficient manner, and gives you some things to consume and understand.

It’s worth mentioning that Hashcat appears to follow a naming convention for these: xxxyy, where xxx denotes the family and yy is either 00 for unsalted, 10 for a post-pended salt, or 20 for a pre-pended salt. There are plenty of counterexamples, but this is the general scheme.

Let’s take a look at an example of a salted hash, with an extra round of hash on the plaintext. Thankfully, we have one in the same hash family: 4520, sha1($salt.sha1($pass)).

Here’s the core routine:

sha1_ctx_t ctx1;
sha1_init (&ctx1);
sha1_update_swap (&ctx1, tmp.i, tmp.pw_len);
sha1_final (&ctx1);
const u32 a = ctx1.h[0];
const u32 b = ctx1.h[1];
const u32 c = ctx1.h[2];
const u32 d = ctx1.h[3];
const u32 e = ctx1.h[4];
sha1_ctx_t ctx = ctx0;
u32 w0[4];
u32 w1[4];
u32 w2[4];
u32 w3[4];
w0[0] = uint_to_hex_lower8_le ((a >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a >> 24) & 255) << 16;
w0[1] = uint_to_hex_lower8_le ((a >> 0) & 255) << 0
| uint_to_hex_lower8_le ((a >> 8) & 255) << 16;
w0[2] = uint_to_hex_lower8_le ((b >> 16) & 255) << 0
| uint_to_hex_lower8_le ((b >> 24) & 255) << 16;
w0[3] = uint_to_hex_lower8_le ((b >> 0) & 255) << 0
| uint_to_hex_lower8_le ((b >> 8) & 255) << 16;
w1[0] = uint_to_hex_lower8_le ((c >> 16) & 255) << 0
| uint_to_hex_lower8_le ((c >> 24) & 255) << 16;
w1[1] = uint_to_hex_lower8_le ((c >> 0) & 255) << 0
| uint_to_hex_lower8_le ((c >> 8) & 255) << 16;
w1[2] = uint_to_hex_lower8_le ((d >> 16) & 255) << 0
| uint_to_hex_lower8_le ((d >> 24) & 255) << 16;
w1[3] = uint_to_hex_lower8_le ((d >> 0) & 255) << 0
| uint_to_hex_lower8_le ((d >> 8) & 255) << 16;
w2[0] = uint_to_hex_lower8_le ((e >> 16) & 255) << 0
| uint_to_hex_lower8_le ((e >> 24) & 255) << 16;
w2[1] = uint_to_hex_lower8_le ((e >> 0) & 255) << 0
| uint_to_hex_lower8_le ((e >> 8) & 255) << 16;
w2[2] = 0;
w2[3] = 0;
w3[0] = 0;
w3[1] = 0;
w3[2] = 0;
w3[3] = 0;
sha1_update_64 (&ctx, w0, w1, w2, w3, 40);
sha1_final (&ctx);

So we can see that this routine uses two sha1 contexts: ctx1 to contain sha1($pass) and ctx to contain sha1($salt.sha1($pass)). Additionally, this kernel introduces a new macro, uint_to_hex_lower8_le, and adds some logic to the _sxx function:

const u64 lsz = get_local_size (0);
/**
* bin2asc table
*/
__local u32 l_bin2asc[256];
for (u32 i = lid; i < 256; i += lsz)
{
const u32 i0 = (i >> 0) & 15;
const u32 i1 = (i >> 4) & 15;
l_bin2asc[i] = ((i0 < 10) ? '0' + i0 : 'a' - 10 + i0) << 0
| ((i1 < 10) ? '0' + i1 : 'a' - 10 + i1) << 8;
}

uint_to_hex_lower8_le looks up a byte (4 bits) in l_bin2asc, which returns a lowercase ASCII representation in 2 bytes (8 bits). We rely on this to convert the raw hex output of the first sha1 round to the lowercase ASCII representation that is usually composed in these salted routines. In this case, the routine assumes little-endian byte ordering, which is appropriate for the SHA family but can differ; for example, the MD5 Hashcat libs are big-endian.

The next step is understanding how to use this construct for sha256. We can find that sha1 has an output of 40 bytes, while sha256 outputs 64 bytes. Let’s set this up in our new kernel. First, add the uint_to_hex_lower8_le macro to the beginning of the file. Then add the routine to build the l_bin2asc array. Next, let’s change the core crack routine:

/**
* loop
*/
for (u32 il_pos = 0; il_pos < il_cnt; il_pos++)
{
pw_t tmp = PASTE_PW;
tmp.pw_len = apply_rules (rules_buf[il_pos].cmds, tmp.i, tmp.pw_len);
sha256_ctx_t ctx; //containing the final context
sha256_init (&ctx);
sha256_ctx_t ctx1; //containing the sha256($pass) context
sha256_init (&ctx1);
sha256_update_swap (&ctx1, tmp.i, tmp.pw_len);
sha256_final (&ctx1); //finalize the sha256 rounds.
const u32 a = ctx1.h[0];
const u32 b = ctx1.h[1];
const u32 c = ctx1.h[2];
const u32 d = ctx1.h[3];
const u32 e = ctx1.h[4];
u32 w0[4];
u32 w1[4];
u32 w2[4];
u32 w3[4];
w0[0] = uint_to_hex_lower8_le ((a >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a >> 24) & 255) << 16;
w0[1] = uint_to_hex_lower8_le ((a >> 0) & 255) << 0
| uint_to_hex_lower8_le ((a >> 8) & 255) << 16;
w0[2] = uint_to_hex_lower8_le ((b >> 16) & 255) << 0
| uint_to_hex_lower8_le ((b >> 24) & 255) << 16;
w0[3] = uint_to_hex_lower8_le ((b >> 0) & 255) << 0
| uint_to_hex_lower8_le ((b >> 8) & 255) << 16;
w1[0] = uint_to_hex_lower8_le ((c >> 16) & 255) << 0
| uint_to_hex_lower8_le ((c >> 24) & 255) << 16;
w1[1] = uint_to_hex_lower8_le ((c >> 0) & 255) << 0
| uint_to_hex_lower8_le ((c >> 8) & 255) << 16;
w1[2] = uint_to_hex_lower8_le ((d >> 16) & 255) << 0
| uint_to_hex_lower8_le ((d >> 24) & 255) << 16;
w1[3] = uint_to_hex_lower8_le ((d >> 0) & 255) << 0
| uint_to_hex_lower8_le ((d >> 8) & 255) << 16;
w2[0] = uint_to_hex_lower8_le ((e >> 16) & 255) << 0
| uint_to_hex_lower8_le ((e >> 24) & 255) << 16;
w2[1] = uint_to_hex_lower8_le ((e >> 0) & 255) << 0
| uint_to_hex_lower8_le ((e >> 8) & 255) << 16;
w2[2] = 0;
w2[3] = 0;
w3[0] = 0;
w3[1] = 0;
w3[2] = 0;
w3[3] = 0;
sha256_update_64 (&ctx, w0, w1, w2, w3, 40); // prepend the internal sha256 hash.
sha256_update (&ctx, s, salt_len); // add the salt
sha256_final (&ctx); // finalize
const u32 r0 = ctx.h[DGST_R0];
const u32 r1 = ctx.h[DGST_R1];
const u32 r2 = ctx.h[DGST_R2];
const u32 r3 = ctx.h[DGST_R3];
COMPARE_S_SCALAR (r0, r1, r2, r3);
}

Next up, we need to address the extra bytes available in sha256’s output:

const u32 a = ctx1.h[0];
const u32 b = ctx1.h[1];
const u32 c = ctx1.h[2];
const u32 d = ctx1.h[3];
const u32 e = ctx1.h[4];
const u32 a1 = ctx1.h[5];
const u32 b1 = ctx1.h[6];
const u32 c1 = ctx1.h[7];
u32 w0[4];
u32 w1[4];
u32 w2[4];
u32 w3[4];
w0[0] = uint_to_hex_lower8_le ((a >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a >> 24) & 255) << 16;
w0[1] = uint_to_hex_lower8_le ((a >> 0) & 255) << 0
| uint_to_hex_lower8_le ((a >> 8) & 255) << 16;
w0[2] = uint_to_hex_lower8_le ((b >> 16) & 255) << 0
| uint_to_hex_lower8_le ((b >> 24) & 255) << 16;
w0[3] = uint_to_hex_lower8_le ((b >> 0) & 255) << 0
| uint_to_hex_lower8_le ((b >> 8) & 255) << 16;
w1[0] = uint_to_hex_lower8_le ((c >> 16) & 255) << 0
| uint_to_hex_lower8_le ((c >> 24) & 255) << 16;
w1[1] = uint_to_hex_lower8_le ((c >> 0) & 255) << 0
| uint_to_hex_lower8_le ((c >> 8) & 255) << 16;
w1[2] = uint_to_hex_lower8_le ((d >> 16) & 255) << 0
| uint_to_hex_lower8_le ((d >> 24) & 255) << 16;
w1[3] = uint_to_hex_lower8_le ((d >> 0) & 255) << 0
| uint_to_hex_lower8_le ((d >> 8) & 255) << 16;
w2[0] = uint_to_hex_lower8_le ((e >> 16) & 255) << 0
| uint_to_hex_lower8_le ((e >> 24) & 255) << 16;
w2[1] = uint_to_hex_lower8_le ((e >> 0) & 255) << 0
| uint_to_hex_lower8_le ((e >> 8) & 255) << 16;
w2[2] = uint_to_hex_lower8_le ((a1 >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a1 >> 24) & 255) << 16;
w2[3] = uint_to_hex_lower8_le ((a1 >> 0) & 255) << 0
| uint_to_hex_lower8_le ((a1 >> 8) & 255) << 16;
w3[0] = uint_to_hex_lower8_le ((b1 >> 16) & 255) << 0
| uint_to_hex_lower8_le ((b1 >> 24) & 255) << 16;
w3[1] = uint_to_hex_lower8_le ((b1 >> 0) & 255) << 0
| uint_to_hex_lower8_le ((b1 >> 8) & 255) << 16;
w3[2] = uint_to_hex_lower8_le ((c1 >> 16) & 255) << 0
| uint_to_hex_lower8_le ((c1 >> 24) & 255) << 16;
w3[3] = uint_to_hex_lower8_le ((c1 >> 0) & 255) << 0
| uint_to_hex_lower8_le ((c1 >> 8) & 255) << 16;
sha256_update_64 (&ctx, w0, w1, w2, w3, 64); // prepend the internal sha256 hash.

In my case, it took a few attempts to get the bit shifting / byte ordering correct. Here’s what I did:

I knew, ultimately, that w0, w1, w2, and w3 needed to hold the ASCII representation of the sha256 hash, using our example peanut:

#echo -n peanut | shasum -a 256 | xxd
00000000: 3535 3039 3834 3064 3038 3733 6164 6230 5509840d0873adb0
00000010: 3430 3535 3838 3832 3131 3937 6138 3633 405588821197a863
00000020: 3435 3031 3239 3334 3836 6336 3031 6361 4501293486c601ca
00000030: 3531 6531 3430 3633 6162 6532 3564 3036 51e14063abe25d06
00000040: 2020 2d0a

The array should look like this:

w0[0] = 0x35353039;
w0[1] = 0x38343064;
w0[2] = 0x30383733;
w0[3] = 0x61646230;
w1[0] = 0x34303535;
w1[1] = 0x38383832;
w1[2] = 0x31313937;
w1[3] = 0x61383633;
w2[0] = 0x34353031;
w2[1] = 0x32393334;
w2[2] = 0x38366336;
w2[3] = 0x30316361;
w3[0] = 0x35316531;
w3[1] = 0x34303633;
w3[2] = 0x61626532;
w3[3] = 0x35643036;

These hard-coded values passed the Hashcat self-test—progress!

By hard-coding this array in the loop, we can play with byte ordering and bit shifting until w0[0] holds 0x35353039:

w0[1] = uint_to_hex_lower8_le ((a >> 16) & 255) << 0
| uint_to_hex_lower8_le ((a >> 24) & 255) << 16; //0x38343064;
w0[2] = 0x30383733;
w0[3] = 0x61646230;
w1[0] = 0x34303535;
w1[1] = 0x38383832;
w1[2] = 0x31313937;
w1[3] = 0x61383633;
w2[0] = 0x34353031;
w2[1] = 0x32393334;
w2[2] = 0x38366336;
w2[3] = 0x30316361;
w3[0] = 0x35316531;
w3[1] = 0x34303633;
w3[2] = 0x61626532;
w3[3] = 0x35643036;
printf("\na = 0x%08x, w0[0]=0x%08x\n", a,w[0]); // tombstone.

You can debug the kernel by alternating rm -rf ./hashcat/kernels with your Hashcat single-hash test command. This was essential to getting the byte ordering and conversion correct.

With the right values in w0, w1, w2, and w3, we can move the _sxx changes over to _mxx. Just reapply your changes there while making sure the final comparison function is COMPARE_M instead of COMPARE_S.

Results

Hashcat Password Cracking Software Interface on Computer Screen

On the Synopsys 8 GPU cracking rig, we can hit 605 MH/s with this unoptimized kernel—7 times faster than CPU mode.

We’ll need to wait for another rainy day (and a corpus with a shorter salt) to crack out an optimized kernel.

You can find our final implementation at this GitHub Gist.

Continue Reading

Explore Topics