Utf8JsonReader with Stream Input
The Utf8JsonReader struct
was introduced years ago with .Net Core as a built-in alternative for reading
JSON documents one token at a time as opposed to using external libraries or
reading entire documents into memory at once. It was design to be a forward-only
API outperforming other similar readers like those from the Newtonsoft.Json
libraries. The performance however comes at the cost of convenience and
ease-of-use, at least in some situations.
Such situations include input byte streams which for some reason cannot always be read entirely into memory, for example because their length is unknown or because they simply won’t fit into memory. In situations where you can in good conscience read all input into memory, don’t bother applying the approach discussed here. It will just make your life harder for no good reason. Of course, feel free to satisfy your curiosity regardless.
Note: You can find the example code in GitHub.
Continuous reading from a byte stream
Microsoft has documented reading from a stream using the Utf8JsonReader in its
public documentation (see Read from a stream using Utf8JsonReader). The example given
on that page, however, is very basic and suffers from a few issues which granted
may not always cause noticable problems, so often can be good enough. I’m trying
to address them here anyway. For example, it is not necessary to grow the buffer
by doubling the size of the array, much it is not necessary to copy the
unconsumed remainder of a buffer to a different location in memory.
At this point, it’s important to note that making Utf8JsonReader read across
multiple disjointed blocks of memory is very easily possible using the built-in
ReadOnlySequence<byte> struct
of dotnet, and we’re going to take advantage of that for our scenario, as it
helps us avoid both issues mentioned before.
That being said, every byte stream can obviously be broken down into chunks of
discrete sizes as we wish, and consecutive such chunks can be fed to the
Utf8JsonReader as needed. There are some important details that relate to the
size of the sliding window of bytes we feed the reader. For example, if the end
of a chunk is in the middle of a JSON token (a primitive value such as a number,
a string, a boolean, null, or a property name), the reader will obviously need
more input before it can continue. As long as the input byte stream has more
data, we can read that into a new chunk of memory and pass that to the reader.
Once a single chunk has been consumed completely by the Utf8JsonReader, we can
discard that chunk, thus allowing us to read through an input byte stream of
arbitrary length with a memory footprint that in the worst case is of size
where \(N\) is the size (in bytes) of the largest JSON token (including whitespace) found in the input stream, or the size of the buffers we allocate, whichever is higher. For example: In a byte stream that holds a bounded JSON array of millions of JSON number values, each at most 10-14 digits with an optional leading negative sign making for JSON tokens of at most 15 bytes long, using a chunk size of 4 KiB, we need at most 8 KiB of buffers at any given time when reading through the stream when there’s no unnecessary whitespace, regardless of how long the stream itself is.
Contrary to the example given by Microsoft in the above linked documentation,
we also want to take advantage of the ArrayPool<byte> class
to re-use buffers as much as possible and reduce the number of allocations on
the heap as well as stress on the garbage collector.
We’ll call the class managing the chunks / buffers BufferManager, and we’ll
call the class managing individual chunks BufferManager.Segment following the
same naming given by the ReadOnlySequence<T> struct.
Wrapping Utf8JsonReader
One more important thing to mention is that the Read method
of the Utf8JsonReader struct gives the caller some hints regarding its
internal state. Specifically, it will return false if the read operation on
the input has not resulted in a complete JSON token being read. That can mean
one of multiple things:
- No more tokens are available, i.e., we’re at the end of the input.
- The buffers we provided to the reader have been exhausted and more data needs to be read from the stream.
The former we can know about because it must imply that we’ve also read the stream to the end. In turn this means that we know it’s the latter if we have not yet read the stream to the end.
This by the way creates some interesting situations for the Skip method
of the Utf8JsonReader struct. The Skip method will throw in every case where
the buffer that the reader is working on is not the final block yet, that is,
when we have not yet read the remainder of the stream fully into memory and
passed that to the reader.
Looking at the definition of Utf8JsonReader struct we see that it is declared
as a ref struct type, which comes with a few limitations.
Specifically, Utf8JsonReader values cannot be allocated on the heap and
instead can be kept on the stack only. In earlier versions of C#, they could
also not be used in async methods, though this restriction has been since
been eased a bit. All in all, for our use-case it seems best to not fully wrap
the interface of Utf8JsonReader with its many methods and properties, and
instead provider a helper class that works in tandem with the reader. Let’s call
this helper class JsonStreamProvider.
That means the following responsibilities for our classes:
BufferManager- Keep track of all segments that are currently in use.
- Provide a
ReadOnlySequence<byte>describing the current window for use withUtf8JsonReadervalues.
BufferManager.Segment- Rent and return buffers (
byte[]objects) from theArrayPool<byte>.Sharedpool as needed. - Fill those buffers with data from the input stream and track the start and
end position of each buffer in terms of data already consumed by the reader
(tracked with
BufferHead) and last byte read from the stream (tracked withBufferTail).
- Rent and return buffers (
JsonStreamProvider- Keep track of reader state:
- Initialized to indicate that the state machine was initialized but no
data has been made available to a
Utf8JsonReadervalue yet. - Reading to indicate that we’re in normal read mode.
- BufferedToEnd to indicate that we’ve read the last byte from the input stream into memory.
- BufferMore to indicate that the current buffer(s) are not sufficient to read the next JSON token yet and we therefore need to read more from the input stream.
- Finished to indicate that the input stream has been read to the end and all JSON tokens have been read too.
- Initialized to indicate that the state machine was initialized but no
data has been made available to a
- Create
Utf8JsonReadervalues for the current window and advance (slide) the window as new chunks are read from the input stream. - Provide the wrapping
Read()method, using its return values to change the internal state as necessary. - Provide a reliable
Skip()method that does not have the same limitations as theSkip()method of theUtf8JsonReader.
- Keep track of reader state:
This means the following state machine with transitions for the
JsonStreamProvider class:
JsonStreamProvider.As you can see, while we haven’t read the input stream to the end yet, we keep
going through the loop with the BufferMore state until we are at EOF of the
input stream. This guarantees that, as long as the input stream is all valid
JSON, we read it to the very end and finish only after we’ve read the last JSON
token from the buffered data.
Why so complicated?
Now the above presented solution is obviously more complicated than the one presented in the official documentation from Microsoft (see link above). In my opinion though it’s well worth it in situations where it’s impossible to know upfront how large the stream is, or when you know that you can’t or don’t want to fit it all into memory at once. In such situations, separating the logic that deals with the tedious details of buffer management and sliding windows from the logic that handles / processes JSON tokens read from the stream is worth the re-usable bookkeeping logic.
You can find all the code with all the details in the example GitHub repo I created. The repo also has an example producer of a stream that produces a JSON array of 10 million consecutive int32 values with a random starting number, plus a connected consumer of this stream that verifies that all the values are received in the correct order. Typically, this produces a stream of about 90-110 MiB bytes, which you could still fit in memory, but you can easily change the number of produced values to a size where fitting it into memory really won’t make much sense anymore.