`Utf8JsonReader` with Stream Input

The Utf8JsonReader struct was introduced years ago with .Net Core as a built-in alternative for reading JSON documents one token at a time as opposed to using external libraries or reading entire documents into memory at once. It was design to be a forward-only API outperforming other similar readers like those from the Newtonsoft.Json libraries. The performance however comes at the cost of convenience and ease-of-use, at least in some situations.

Such situations include input byte streams which for some reason cannot always be read entirely into memory, for example because their length is unknown or because they simply won’t fit into memory. In situations where you can in good conscience read all input into memory, don’t bother applying the approach discussed here. It will just make your life harder for no good reason. Of course, feel free to satisfy your curiosity regardless.

Note: You can find the example code in GitHub.

Continuous reading from a byte stream

Microsoft has documented reading from a stream using the Utf8JsonReader in its public documentation (see Read from a stream using Utf8JsonReader). The example given on that page, however, is very basic and suffers from a few issues which granted may not always cause noticable problems, so often can be good enough. I’m trying to address them here anyway. For example, it is not necessary to grow the buffer by doubling the size of the array, much it is not necessary to copy the unconsumed remainder of a buffer to a different location in memory.

At this point, it’s important to note that making Utf8JsonReader read across multiple disjointed blocks of memory is very easily possible using the built-in ReadOnlySequence<byte> struct of dotnet, and we’re going to take advantage of that for our scenario, as it helps us avoid both issues mentioned before.

That being said, every byte stream can obviously be broken down into chunks of discrete sizes as we wish, and consecutive such chunks can be fed to the Utf8JsonReader as needed. There are some important details that relate to the size of the sliding window of bytes we feed the reader. For example, if the end of a chunk is in the middle of a JSON token (a primitive value such as a number, a string, a boolean, null, or a property name), the reader will obviously need more input before it can continue. As long as the input byte stream has more data, we can read that into a new chunk of memory and pass that to the reader. Once a single chunk has been consumed completely by the Utf8JsonReader, we can discard that chunk, thus allowing us to read through an input byte stream of arbitrary length with a memory footprint that in the worst case is of size

$$ O(2N) $$

where $N$ is the size (in bytes) of the largest JSON token (including whitespace) found in the input stream, or the size of the buffers we allocate, whichever is higher. For example: In a byte stream that holds a bounded JSON array of millions of JSON number values, each at most 10-14 digits with an optional leading negative sign making for JSON tokens of at most 15 bytes long, using a chunk size of 4 KiB, we need at most 8 KiB of buffers at any given time when reading through the stream when there’s no unnecessary whitespace, regardless of how long the stream itself is.

Figure 1. Input stream split into chunks with sample sliding windows of UTF-8 data.

Contrary to the example given by Microsoft in the above linked documentation, we also want to take advantage of the ArrayPool<byte> class to re-use buffers as much as possible and reduce the number of allocations on the heap as well as stress on the garbage collector.

We’ll call the class managing the chunks / buffers BufferManager, and we’ll call the class managing individual chunks BufferManager.Segment following the same naming given by the ReadOnlySequence<T> struct.

Wrapping `Utf8JsonReader`

One more important thing to mention is that the Read method of the Utf8JsonReader struct gives the caller some hints regarding its internal state. Specifically, it will return false if the read operation on the input has not resulted in a complete JSON token being read. That can mean one of multiple things:

No more tokens are available, i.e., we’re at the end of the input.
The buffers we provided to the reader have been exhausted and more data needs to be read from the stream.

The former we can know about because it must imply that we’ve also read the stream to the end. In turn this means that we know it’s the latter if we have not yet read the stream to the end.

This by the way creates some interesting situations for the Skip method of the Utf8JsonReader struct. The Skip method will throw in every case where the buffer that the reader is working on is not the final block yet, that is, when we have not yet read the remainder of the stream fully into memory and passed that to the reader.

Looking at the definition of Utf8JsonReader struct we see that it is declared as a ref struct type, which comes with a few limitations. Specifically, Utf8JsonReader values cannot be allocated on the heap and instead can be kept on the stack only. In earlier versions of C#, they could also not be used in async methods, though this restriction has been since been eased a bit. All in all, for our use-case it seems best to not fully wrap the interface of Utf8JsonReader with its many methods and properties, and instead provider a helper class that works in tandem with the reader. Let’s call this helper class JsonStreamProvider.

That means the following responsibilities for our classes:

BufferManager
- Keep track of all segments that are currently in use.
- Provide a ReadOnlySequence<byte> describing the current window for use with Utf8JsonReader values.
BufferManager.Segment
- Rent and return buffers (byte[] objects) from the ArrayPool<byte>.Shared pool as needed.
- Fill those buffers with data from the input stream and track the start and end position of each buffer in terms of data already consumed by the reader (tracked with BufferHead) and last byte read from the stream (tracked with BufferTail).
JsonStreamProvider
- Keep track of reader state:
  - Initialized to indicate that the state machine was initialized but no data has been made available to a Utf8JsonReader value yet.
  - Reading to indicate that we’re in normal read mode.
  - BufferedToEnd to indicate that we’ve read the last byte from the input stream into memory.
  - BufferMore to indicate that the current buffer(s) are not sufficient to read the next JSON token yet and we therefore need to read more from the input stream.
  - Finished to indicate that the input stream has been read to the end and all JSON tokens have been read too.
- Create Utf8JsonReader values for the current window and advance (slide) the window as new chunks are read from the input stream.
- Provide the wrapping Read() method, using its return values to change the internal state as necessary.
- Provide a reliable Skip() method that does not have the same limitations as the Skip() method of the Utf8JsonReader.

This means the following state machine with transitions for the JsonStreamProvider class:

Figure 2. State machine for JsonStreamProvider.

As you can see, while we haven’t read the input stream to the end yet, we keep going through the loop with the BufferMore state until we are at EOF of the input stream. This guarantees that, as long as the input stream is all valid JSON, we read it to the very end and finish only after we’ve read the last JSON token from the buffered data.

Why so complicated?

Now the above presented solution is obviously more complicated than the one presented in the official documentation from Microsoft (see link above). In my opinion though it’s well worth it in situations where it’s impossible to know upfront how large the stream is, or when you know that you can’t or don’t want to fit it all into memory at once. In such situations, separating the logic that deals with the tedious details of buffer management and sliding windows from the logic that handles / processes JSON tokens read from the stream is worth the re-usable bookkeeping logic.

You can find all the code with all the details in the example GitHub repo I created. The repo also has an example producer of a stream that produces a JSON array of 10 million consecutive int32 values with a random starting number, plus a connected consumer of this stream that verifies that all the values are received in the correct order. Typically, this produces a stream of about 90-110 MiB bytes, which you could still fit in memory, but you can easily change the number of produced values to a size where fitting it into memory really won’t make much sense anymore.

Utf8JsonReader with Stream Input

Continuous reading from a byte stream #

Wrapping Utf8JsonReader #

Why so complicated? #

`Utf8JsonReader` with Stream Input

Continuous reading from a byte stream

Wrapping `Utf8JsonReader`

Why so complicated?