Prompting GPT Models to Return JSON

July 24 2023 · tech software-engineering ai

August 2024 Update: Now a solved problem. Use Structured Outputs.

Large language models (LLMs) return unstructured output. When we prompt them they respond with one large string. This is fine for applications such as ChatGPT, but in others where we want the LLM to return structured data such as lists or key value pairs, a parseable response is needed. In Building A ChatGPT-enhanced Python REPL I used a technique to prompt the LLM to return output in a text format I could parse. Here I explore several ways to get OpenAI LLMs to respond in JSON:

  1. Prompting
  2. OpenAI Functional Calling (recommended)
  3. Techniques from 3rd Party libraries:

My use case is a simple code generator. Given a prompt, I want the LLM to generate code and a description and return them as keyed fields in a JSON object. eg.

>>> generate_code("generate a recursive fibonacci function and explain it")
{
  "code": "def fibonacci(n): ...",
  "description": "To implement a recursive Fibonacci function..."
}

Link to this section Prompting

The least reliable and most conceptually mind bending way to get an LLM to return JSON is to instruct it to in a system prompt. Of the billions of strings it’s been trained on perhaps the LLM some semblance of what JSON is, though not without it’s faults. Below is a system prompt I feed into every API call to the LLM.

You are a coding assistant. If I ask you a question you should return concise descriptions and code.

If you don’t know the answer to a question say you do not know the answer do it and generate no code.

You are an assistant that only responds in JSON. Do not write normal text.

[no prose][Output only valid JSON]

Return responses as a valid JSON object with two keys:
The first key is description, a string, which must contain all natural language
the second key is code, a string, which must contain all code samples

The JSON response must be single line strings. Use the \n newline character within the string to indicate new lines

And the code to call it.

def generate_code(prompt: str) -> CodeGenerationResult:

    response = openai.ChatCompletion.create(
      model="gpt-3.5-turbo",
      messages=[
          {"role": "system", "content": SYSTEM_PROMPT}, 
          {"role": "user", "content": prompt}
      ],
      max_tokens=1000,
      n=1
    )

    decoded_response = json.loads(response.choices[0].message.content.strip())

    return CodeGenerationResult(code=decoded_response["code"], description=decoded_response["description"])

It’s not reliable. Most of the time the LLM returns data in valid JSON structure. However with no pattern or indication it will respond in a single block of plain text. Sometimes it responds with JSON-ish data. For example without the last clause of the system prompt it would respond with multi lined strings which is not supported by the JSON spec.

On the plus side, this is OK for local development and exploration. It should also work on most LLMs, to varying degrees of success. Unlike the more reliable OpenAI Functional Calling we look at next, which is proprietary to the OpenAI-provided APIs.

Link to this section OpenAI Functional Calling

The most reliable way to get JSON outputs from OpenAI’s GPT models is to use the function calling API.

First specify a JSON schema of the return type. Use verbose property names and descriptions as the LLM reads these.

generate_code_schema = {
  "type": "object",
  "properties": {
    "code": {
      "type": "string",
      "description": "Generated code sample. No natural language"
    },
    "description": {
      "type": "string",
      "description": "English language explanation or description of the generated code"
    }
  },
  "required": ["code", "description"]
}

Then wire it into the call.

def generate_code(prompt: str) -> CodeGenerationResult:

    response = openai.ChatCompletion.create(
      model="gpt-3.5-turbo",
      messages=[
          {"role": "system", "content": SYSTEM_PROMPT}, 
          {"role": "user", "content": prompt}
      ],
      max_tokens=1000,
      n=1,
      functions=[
          {
              "name": "generate_code",
              "description": "generates code and description in JSON format. used by code assistants",
              "parameters": generate_code_schema
          }
      ]
    )

    decoded_response = json.loads(response.choices[0].message.function_call.arguments.strip())

    return CodeGenerationResult(code=decoded_response["code"], description=decoded_response["description"])

I understand that behind the scenes the JSON schema and function declaration block are passed into the LLM as raw text. You must be verbose in the naming of symbols and descriptions, and ensure they match the original system prompt we provided. ie. the system prompt instructs the LLM that it must generate code, and we provide it a function named generate_code.

I’ve found this approach more reliable than raw prompting. It is coupled to the OpenAI API; other models may do this differently. As we’ll see below in 3rd party libraries folks have already mirrored this approach.

Link to this section Techniques From 3rd Party Libraries

Finally we look at some of the techniques used by 3rd party libraries in this area. There are a lot of them, specific to each tech stack.

Link to this section Verify and Retry

If it doesn’t work, try it again. TypeChat, ZodGPT, and no doubt countless others have a basic loop of:

  1. Prompt the LLM to respond in a structured format (like the first method of this post)
  2. Parse and verify the response is as-expected
  3. If it fails verification, augment the prompt with the error information and prompt the LLM again

It’s a crude but effective software engineering pattern that you should be applying in all interactions with LLMs. Watch out for increases in cost and latency. ZodGPT runs on any LLM whereas TypeChat is currently coupled to the OpenAI models.

Link to this section GPT Knows Typescript?

ZodGPT and the approaches above require the LLM to understand JSON. TypeChat takes it a step further and prompts the LLM by passing in Typescript type definitions. TypeChat then takes the response from the LLM and runs it through an actual TypeScript type checker.

This seems a lot more thorough than the ZodGPT approach but it’s not clear it would result in any actual benefits. Data is still typed and verified.

Link to this section Formal Grammars

A recent pull request to llama.cpp gives us insight into what might be the future of LLMs returning structured data. It allows us (the caller) to provide a formal grammar which the LLM adheres to in responding to a prompt. What does this mean?

GPTs generate the next token in a sequence. Conceptually we could think of this as generating the next character in a string. Behind the scenes the model generates a bunch of possible next tokens and assigns probabilities to them as to their suitability. We don’t want the highest rated token every time or we get repeated and ‘uncreative’ output. In a selection process one of those tokens is chosen, and the model moves to generating the next token.

Consider what happens when we ask a model to return data in JSON format. The model is generating and is up to here: { "description". We know that for valid JSON the next character has to be a :, and perhaps this is the token that has been assigned the highest probability, but the model may still choose a different token. And here is the tension. We want the model to be ‘creative’ and novel at a high level - in the words and code it generates, but absolutely formal when it comes to the JSON syntax that wraps at. To the model there is no difference.

The pull request above introduces the concept of grammars at the token selection stage. All of the possible ’next tokens’ are run through the grammar and tokens which don’t meet the grammar are discarded. A token which meets the grammar is selected, and the model moves on.

Link to this section Conclusion

The current state of LLMs responding with structured data is good enough but not reliable. No matter if you’re using a simple prompt or OpenAI Function Calling, use software engineering techniques such as retry logic, and add the required observability so that it doesn’t run out of control. The formal grammar support recently added into llama.cpp is promising, and if a similar approach is adopted by LLM providers could offer a reliable future without the need for programming language specific client libraries to patch over the shortcomings of LLMs.



Related Posts