Software Integrity Blog

 

How I sped up one of our test suites by 17.5x by explicitly defining Elixir protocol implementations

Fallback protocols could be causing slower tests when running with code coverage. Defining protocol implementations can improve the speed.

Hi, I’m Philip Ross, a developer working on Tinfoil API Scanner at Synopsys Software Integrity Group. A few months ago, I sped up one of our test suites from about 6 minutes to around 20 seconds.

I was working on development improvements for our umbrella project and I noticed that one of the child application’s (Parser) tests took much longer (approximately 6 minutes) when I ran mix test in the umbrella root than when I ran mix test in the root of Parser (approximately 16 seconds). Why would the tests be so much slower as part of the umbrella? I tried running just the Parser tests from the umbrella root with mix cmd –app parser mix test and it took  approximately 16 seconds.

Identifying bottlenecks

I tried profiling the test that was slow and got this back:

FUNCTION                                                                     CALLS        %      TIME  [uS / CALLS]
--------                                                                     -----  -------      ----  [----------]
'Elixir.Enum':filter_list/2                                                2330877     5.24    733215  [      0.31]
maps:next/1                                                                2425581     5.74    802036  [      0.33]
maps:fold_1/3                                                              2425581     7.35   1028416  [      0.42]
'Elixir.Parser.Parsers.OpenAPI3.Stage.Validate':'-do_validate/2-fun-0-'/2  2161922     9.35   1307672  [      0.60]
erlang:raise/3                                                               77960    13.63   1906506  [     24.45]
------------------------------------------------------------------------  --------  -------  --------  [----------]
Total:                                                                    42120680  100.00%  13984253  [      0.33]

Figure 1. Truncated output of the profiler

I saw that erlang:raise/3 was taking up the most time. There were no exceptions that I could see in this test. The rest of the profile seemed reasonable, so I generated a flame graph.

Figure 2. Flame graph of the test

The sleep taking up half of the flame graph seemed suspicious, but it’s not actually part of the test stack trace, so it could be ignored. Other than that, the huge number of list operations was the main thing that stood out as a bottleneck. Since these are to be expected, I determined that the problem must be hiding elsewhere. Next, I tried cloning a new copy of the repo and removing all other child applications, and then removing configurations that were in the umbrella root. Eventually, I found that when I removed consolidate_protocols: Mix.env() != :test,, the Parser tests would be fast.


Figure 3. Enabling consolidate_protocols in test

Enabling protocol consolidation

I read up on protocol consolidation and it was mentioned that disabling protocol consolidation may slow down tests. I disabled protocol consolidation in the test environment for Parser and ran just the Parser tests again. Bingo, the Parser tests took approximately 2 minutes and 10 seconds. I figured that not consolidating the protocols was the cause of the test slowness and started preparing a merge request that enabled protocol consolidation in the test environment for all of the subprojects and in the umbrella root. I ran mix test again in the umbrella root, and it took approximately 15 seconds to run the Parser tests. Great—problem solved, right?

Wrong. The Parser tests in the CI pipeline still took approximately 6 minutes to run. But I made the merge request and called it good enough that running mix test in the umbrella root no longer took approximately 8 minutes to run.

Addressing Parser tests in CI pipeline

Two weeks later our CI system became overloaded. Our tests were slowing down at a rate that was causing them to time out, which resulted in failing builds. I could have increased the timeout, but then I would need to merge that, and rebase all the other merge requests that were in progress. But that still wouldn’t fix the root problem; it would only push it back.

I decided that I could now spend some more time digging into why the Parser tests were so slow in the CI pipeline, since it was making it impossible for us to merge branches (without just ignoring the test build failure). I suspected there was something wrong with the CI runner, but I didn’t have direct access to verify.

After some effort, I was able to locally reproduce the slow tests by running them with code coverage like in the CI. It took approximately 6 minutes to run the Parser tests with code coverage. That’s good; I didn’t need to run everything through Docker anymore to test if it was fixed. I googled around, I stared at the stack trace of where it timed out, I googled around, I stared at the stack trace.

stacktrace:
  (kernel 7.1) code_server.erl:139: :code_server.call/1
  (kernel 7.1) error_handler.erl:41: :error_handler.undefined_function/3
  (parser 0.1.0) lib/util/validate.ex:23: Parser.Validate.impl_for/1
  (parser 0.1.0) lib/util/validate.ex:33: Parser.Validate.validate/1
  (parser 0.1.0) lib/parsers/openapi3/stage/stage_validate.ex:43: 
Parser.Parsers.OpenAPI3.Stage.Validate.do_validate/2
  (elixir 1.11.2) lib/enum.ex:2181: Enum."-reduce/3-lists^foldl/2-0-"/3
  (parser 0.1.0) lib/parsers/openapi3/stage/stage_validate.ex:56: anonymous fn/2 in 
Parser.Parsers.OpenAPI3.Stage.Validate.do_validate/2
  (stdlib 3.13.2) maps.erl:233: :maps.fold_1/3
  (elixir 1.11.2) lib/enum.ex:2181: Enum."-reduce/3-lists^foldl/2-0-"/3
  (parser 0.1.0) lib/parsers/openapi3/stage/stage_validate.ex:56: anonymous fn/2 in 
Parser.Parsers.OpenAPI3.Stage.Validate.do_validate/2
  (stdlib 3.13.2) maps.erl:233: :maps.fold_1/3
  (elixir 1.11.2) lib/enum.ex:2181: Enum."-reduce/3-lists^foldl/2-0-"/3
  (parser 0.1.0) lib/parsers/openapi3/stage/stage_validate.ex:56: anonymous fn/2 in 
Parser.Parsers.OpenAPI3.Stage.Validate.do_validate/2
  (stdlib 3.13.2) maps.erl:233: :maps.fold_1/3
  (elixir 1.11.2) lib/enum.ex:2181: Enum."-reduce/3-lists^foldl/2-0-"/3
  (parser 0.1.0) lib/parsers/openapi3/stage/stage_validate.ex:56: anonymous fn/2 in 
Parser.Parsers.OpenAPI3.Stage.Validate.do_validate/2
  (stdlib 3.13.2) maps.erl:233: :maps.fold_1/3
  (parser 0.1.0) lib/parsers/openapi3/stage/stage_validate.ex:56: anonymous fn/2 in 
Parser.Parsers.OpenAPI3.Stage.Validate.do_validate/2
  (stdlib 3.13.2) maps.erl:233: :maps.fold_1/3
  (parser 0.1.0) lib/parsers/openapi3/stage/stage_validate.ex:56: anonymous fn/2 in 
Parser.Parsers.OpenAPI3.Stage.Validate.do_validate/2

Figure 4. Stack trace of the CI failure

The lightbulb moment

The last bit about :error_handler.undefined_function/3 and Parser.Validate.validate/1 got me thinking about when I profiled the code, and how it was spending the most time in erlang:raise/3. I did some more googling and eventually came across a discussion on GitHub. Most of the discussion involved consolidating protocols, which I had already done, but it got me thinking that perhaps @fallback_to_any was the cause of the slowness.  I started looking into how protocols are implemented in Elixir. I looked through the source code of Elixir and eventually came across this snippet.

Kernel.defp struct_impl_for(struct) do
  target = Module.concat(__MODULE__, struct)

  try do
    target.__impl__(:target)
  rescue
    UndefinedFunctionError ->
      unquote(any_impl_for)
  end
end

Figure 5. Snippet from https://github.com/elixir-lang/elixir/blob/1b13706fefe2caec9093c84ca01164756617a373/lib/elixir/lib/protocol.ex#L775-L784

It looked like Elixir tried to call the protocol function; if it doesn’t have one defined, it errors, catches that error, and then runs the fallback implementation. That could explain why there was so much time spent in erlang:raise/3. We were calling the validate function a lot, and most of the time it would use the fallback implementation. It seems like this could be the problem.

Defining Elixir protocol implementations

I coded some debugging hooks that would track how often each protocol implementation was called.

defmodule TypeCounter do
  use Agent

  def start_link() do
    Agent.start(fn -> %{} end, name: __MODULE__)
  end

  def value do
    Agent.get(__MODULE__, fn x ->
      x
      |> Map.to_list()
      |> List.keysort(1) # sort by total number of hits
      |> Enum.reverse()
    end)
  end

  def increment(key) do
    Agent.update(__MODULE__, fn state ->
      Map.update(state, key, @start_value, fn existing_value -> existing_value + 1 end)
    end)
  end
end

defprotocol Parser.Validate do
  @fallback_to_any true

  @spec validate(map()) :: :ok | {:error, [binary()]}
  def validate(item)
end

Figure 6. Code to track how many times, per type, the protocol falls back to Any

I ran the tests.

[
  {"Parser.Specification.OpenAPI3.Schema", "161,248"},
  {"Parser.Specification.OpenAPI2.Parameter", "3,603"},
  {"Parser.Specification.OpenAPI2.Response", "2,562"},
  {"Parser.Specification.OpenAPI3.Parameter", "2,034"},
  {"Parser.Specification.OpenAPI3.MediaType", "2,030"},
  {"Parser.Specification.OpenAPI3.Response", "1,793"},
  {"Parser.Specification.OpenAPI3.Operation", "924"},
  {"Parser.Specification.OpenAPI2.Operation", "876"},
  {"Parser.Specification.CircularRefPtr", "583"},
  {"Parser.Specification.OpenAPI2.Path", "567"},
  {"Parser.Specification.OpenAPI3.Path", "560"},
  {"Parser.Specification.OpenAPI2.Extension", "301"},
  {"Parser.Specification.OpenAPI3.RequestBody", "229"},
  {"Parser.Specification.OpenAPI3.Extension", "11"},
  {"Parser.Specification.OpenAPI3.Callback", "3"}
]

Figure 7. Number of types called for validate

That’s a lot of schema calls. I defined a validate implementation for OpenAPI3.Schema. It was the same as the fallback protocol; the only difference was that it didn’t have to go through the fallback process.

defmodule Parser.Specification.OpenAPI3.Schema do
  use TypedStruct
  use Parser.Specification.Builder

  alias __MODULE__

  typedstruct do
    field(:name, String.t(), enforce: true, default: nil)
  end

  builder()

  defimpl Parser.Validate, for: __MODULE__ do
    @spec validate(Schema.t()) :: :ok
    def validate(_) do
      :ok
    end
  end
end

Figure 8. Defining a validate implementation

I ran the tests. It finished the Parser tests in approximately 36 seconds. It worked; that fixed our test slowness.

I defined more validation implementations for the other modules. I don’t like hardcoding all the default implementations in the validate file since we override it in the specific module file. There were also other modules that might have validate called on them, since this is only what gets called in our tests. I made a macro to generate the validate method, which resulted in just one line in each module file to generate the default implementation.

defmodule Parser.Macros do
  defmacro __using__(_opts) do
    quote do
      import Parser.Macros
    end
  end

  defmacro empty_validate(mod) do
    quote do
      defimpl Parser.Validate, for: unquote(mod) do
        @spec validate(unquote(__CALLER__.module).t()) :: :ok
        def validate(%unquote(__CALLER__.module){}), do: :ok
      end
    end
  end
end

defprotocol Parser.Validate do
  @fallback_to_any true

  @spec validate(map()) :: :ok | {:error, [binary()]}
  def validate(item)
end

defimpl Parser.Validate, for: Any do
  def validate(_) do
    :ok
  end
end

Figure 9. Macro results

I ran the tests again. It took approximately 20 seconds to run the Parser tests. I could live with it being 4 seconds slower with code coverage. I declared victory and made my merge request, giving a summary of my work, as well as the new test times so people could see how huge of an improvement this was.

Coming up with a theory for what happened

The merge request was completed and my coworkers rebased their branches and we all agreed how bizarre this was, since this was only a problem when run with code coverage. My theory is that when the project is compiled with code coverage, for some reason it doesn’t consolidate the protocols. I made a new repository to test this theory (https://github.com/baseballlover723/cover_protocol). I ran the tests with MIX_ENV=test mix coveralls.json –trace.

SlowProtocolTest
  * test Elixir.SlowProtocol (1670.2ms)
FastProtocolTest
  * test Elixir.FastProtocol (2131.6ms)

Figure 10. Test times of my reproduction repo

Not only was there no significant difference in test times, but using explicit protocols was actually slower than using the fallback. That doesn’t make any sense. It solved our problem though, since it got significantly faster after not using the protocol fallback. My new theory is that compiling with code coverage adds some level of complexity such that without code coverage, Elixir can inline/consolidate the protocols, but with code coverage, it can’t and thus is slow. A few weeks later, we extracted out the Parser app into its own repo (for independent reasons), and I tried to repoduce the original issue, but was unable to. This makes me think that the fact that it was in an umbrella project was also a contributing factor for the test slowness. But since our initial problem was fixed, there was no need to figure out exactly how this happened anymore. I’ll leave it as an exercise for the reader.

Final thoughts

Some takeaways from this experience:

  • Don’t spend too much time fixing something that’s not a problem yet.
  • Find a partial solution. It’s better than no solution.
  • Stripping away more and more parts of a project can help you identify where the root cause of a problem is located.
  • Use a profiler and/or a flame graph to identify significant bottlenecks.
  • Pay close attention to stack traces when something times out since they are likely to be in the slow section of code.

During the process of writing this blog post, one of my coworkers discovered that the macro that I had written was essentially doing something that was already built into Elixir, protocol deriving (https://hexdocs.pm/elixir/Protocol.html#derive/3). Derived protocols are normally used for more dynamic default protocol implementations, where you may want to pass arguments to the default protocol implementation, or generate different implementations based on the type. This lets us define the default protocol implementation by calling @derive Parser.Validate in each module file instead of empty_validate(__MODULE__). When @derive is called with 1 parameter, the protocol name, it looks for a __deriving__/3 callback in the Any protocol implementation. If a parameter is not defined, like in our case, then it defaults to using the Any implementation of the protocol. See https://github.com/elixir-lang/elixir/blob/01c4ccff77fb09ef0559cde39820ab3325fc572c/lib/elixir/lib/kernel.ex#L4623-L4628 and https://github.com/elixir-lang/elixir/blob/v1.11.3/lib/elixir/lib/protocol.ex#L883-L915 for more information. This is perfect for our use case, since like the macro, it defines the default protocol implementations at compile time, still runs the Parser tests with code coverage in ~20 seconds and additionally, requires less boilerplate code.

Here’s what our final version of the code looks like now.

defprotocol Parser.Validate do
  @fallback_to_any true

  @spec validate(map()) :: :ok | {:error, [binary()]}
  def validate(item)
end

defimpl Parser.Validate, for: Any do
  def validate(_) do
    :ok
  end
end

defmodule Parser.Specification.OpenAPI3.Schema do
  use TypedStruct
  use Parser.Specification.Builder
  @derive Parser.Validate

  typedstruct do
    field :name, String.t(), enforce: true, default: nil
  end

  builder()
end


After this blog was posted, José Valim (creator of Elixir), found and fixed this bug here. I verified that this fix would have eliminated our test slowness after we enabled protocol consolidation in test. This fix will be in Elixir 1.12.0+.

Learn how we build Tinfoil API Scanner