GPT-React

Introduction

The full code for this is avaliable here for reference.

A while ago, I saw a demo video of Vercel's V0 and was blown away by what it could produce. It could take in user prompts, feedback and iteratively generate new and improved UI code using the popular @shadcn/ui library. This was soon followed by the open-v0 project by raidendotai. Since I didn't have access to v0 via vercel, i figured I would clone the project and try to figure out how it worked.

One eventful friday evening later, I ended up putting together a small prototype which uses context-aware RAG and pydantic to generate valid NextJS Code based on a user prompt which you can see below.

The Gif renders pretty slowly for some reason so if you want to see the original clip, you can check it out here

Overview

Let's break down how it works under the hood. On a high level, whenerver a user submits a prompt we do the following

First, we extract out a list of substasks which might be relevant to solving this problem. We also pass in a list of components and a short description for each of them into the prompt for added context.
We then take these subtasks and embed them, performing a lookup on code chunks in our vector db. This vector db contains a embedding map of task ( which was generated by gpt 4 ) to a code chunk from the @shadcn/ui library.
We then perform a deduplication on our code chunk extracted so that the model only sees each example once.
We also pass in a set of Rules. These are a set of conditions which the model has to conform to.
We then get the model to generate the code therefafter.

Note that the entire codebase used gpt-4 as a default. I didn't have the time to do fine-tuning.

Data Preparation

As anyone working with LLMs will tell you, it's most important to have good quality data. In this case, that's very true.

In my case, I chose to process the data as such

First, extract out all of the code chunks in the @shadcn/ui library. I used a dump.json file which was in the open-v0 repository to do this.
Next, for each code chunk, I got GPT-4 to generate a task that might have led to this code chunk being written.

Let's see a quick example.

import { Input } from "@/components/ui/input";

export function InputDemo() {
  return <Input type="email" placeholder="Email" />;
}

What are some tasks that could have been assigned to a developer who submits a code chunk like this? In this case, GPT-4 came up with

Create a new component named InputDemo that uses the Input component from the @shadcn/ui library
Set the type of the Input component to email
Set the placeholder of the Input component to Email
Ensure the InputDemo component is exported as the default export

So we perform this task generation for each and every code example that's given in the @shadcn/ui library. We ideally want more potential tasks so that for each code chunk, we have more potential options that we can check against when a user makes a query.

Generating the Tasks

The original dump.json file doesn't store code chunks nicely and has a lot more metadata. So we need to first massage the data. This is done by using the extract_examples_from_data function in the source code

def extract_examples_from_data(doc_data):
    return [i["code"] for i in doc_data["docs"]["examples"]]

Once we've extracted the source code out, we now have a collection of code chunks. Now let's think a bit about the kind of data structure we expect back. This is where pydantic shines.

class Task(BaseModel):
    """
    This is a class which represents a potential task that could have resulted in the code snippet provided

    eg. I want a button that generates a toast when it's clicked
    eg. I want a login form which allows users to key in their email and validates that it belongs to the facebook.com domain.
    """
    task: str = Field(description="This is a task which might have resulted in the component")

it's useful here to point out that the more examples you provide and the more descriptive your class definition is, the better your eventual outputs are going to be. But since we want multiple tasks instead of just one from the code chunk, we can take advantage of the MultiTask functionality provided in the instructor library.

Don't forget to run instructor.patch() before you run your functions so you get the nice functionality it provides

This allows us to query GPT-4 for multiple instances of the Task object by creating a new pydantic class. In our case, we'll call it MultiTaskType so that we don't get a naming conflict.

MultiTaskType = instructor.MultiTask(Task)

We can then use this in our code as follows

completion = openai.ChatCompletion.create(
        model="gpt-4",
        temperature=0.3,
        stream=False,
        max_retries=2,
        functions=[MultiTaskType.openai_schema],
        function_call={"name": MultiTaskType.openai_schema["name"]},
        messages=[
        {
            "role": "system",
            "content": f"As an experienced programmer using the NextJS Framework with the @shadcn/ui and tailwindcss library, you are tasked with brainstorming some tasks that could have resulted in the following code chunk being produced."
        },
        {
            "role":"assistant",
            "content":"Examples of such tasks could be adding a toast component to display temporary messages, using a specific variant avaliable in the @shadcn/ui library or configuring a component that toggles between two display states"
        },
        {
        "role":"assistant",
        "content": "Tasks should be as diverse as possible while staying relevant. Generate at most 4 Tasks."
            },
        {
            "role": "user",
            "content": f"{chunk}",
        },
        ],
        max_tokens=1000,
        )
res = MultiTaskType.from_response(completion)
return [i.task for i in list(res)[0][1]]

Notice here that we get automatic retries with the new max_retries parameter in the openai library thanks to patching openai. While running this command, since we're firing out a lot of different api calls to gpt-4, we might also get rate-limited.

At the time of this article, GPT-4 has a rate-limit of approximately 200 requests per minute. Therefore, in running this script, it's very likely that you might be rate-limited for a large dataset. PLEASE BUILD RATE-LIMITING AND CHECKPOINTS INTO YOUR CODE.

Therefore, I found it useful to run this function async with the following logic

async def generate_task_given_chunk(chunk:str,retries=3)->List[str]:
    for _ in range(retries):
        try:
          // Execute code completion code here
        except Exception as e:
          print("----Encountered an exception")
          print(e)
          await asyncio.sleep(60)

We can then map over all the examples in a single async call.

code_examples:List[str] = extract_examples_from_data(data[mapping[key]])
loop = asyncio.get_event_loop()
tasks = [generate_task_given_chunk(code_example) for code_example in code_examples]
results = loop.run_until_complete(asyncio.gather(*tasks))

We then save the results at each step by using

for code_example, potential_tasks in zip(code_examples, results):
    if key not in queries:
        queries[key] = {}
    queries[key][code_example] = potential_tasks

with open(data_dir,"w+") as f:
    json.dump(queries, f)

Running it on the entire code chunk examples took me about 10-20 minutes so just let it run and get a cup of coffee.

Chromadb

While there are a lot of potential options out there, I wanted a simple to deploy vector db that I could run locally. After some research, I ended up looking at chromadb which provides a python integration and persistent data.

pip install chromadb

We can then initialise a persistent Chromadb instance by running

chroma_client = chromadb.PersistentClient(path= #path here)

So once we've generated the code tasks and saved it in a giant json file, we can now generate some embeddings for easy lookup. In my case, I'm storing my code in a collection called task_to_chunk. I delete the collection on each run of my generate script so that I end up with a fresh embedding database.

try:
  collection = chroma_client.get_collection(collection_name)
  chroma_client.delete_collection(collection_name)
  print(f"Deleted {collection_name}.")
except Exception as e:
  print(f"Collection {collection_name} does not exist...creating now")
# We generate the task list, and we have the code. The next step is to then embed each of the individual tasks into a chromadb database
collection:chromadb.Collection = chroma_client.get_or_create_collection(collection_name)

We can then embed the entire list of tasks using the default settings in chromadb

# Then we embed the individual queries
for component in queries.keys():
    for chunk in queries[component].keys():
        for task in queries[component][chunk]:
            id = uuid.uuid4()
            collection.add(
                documents=[task],
                metadatas=[{
                    "chunk":chunk,
                    "component":component,
                }],
                ids=[id.__str__()]
            )

Note here that we embed the task and then store the chunk and component as metadata for easy lookup. We also generate a unique uuid for each task we add into the database.

Server

Scaffolding

We can now generate a small python server using FastAPI. We can initialise it by creating a simple main.py script as seen eblow

from typing import Union
from pydantic import BaseModel
from fastapi import FastAPI
import chromadb
import dotenv
import openai
import os

dotenv.load_dotenv()
openai.api_key = os.environ["OPENAI_KEY"]

app = FastAPI()
chroma_client = chromadb.PersistentClient(path="./chromadb")
collection = chroma_client.get_collection("task_to_chunk")

class UserGenerationRequest(BaseModel):
    prompt:str

@app.post("/")
def read_root(Prompt:UserGenerationRequest):
  return "OK"

In our example above, we

Loaded an Open AI API Key in the system env
Started a chroma instance that we can query against
Got a reference to our chromadb collection

Generating Sub Tasks

Now, that we can take in a user prompt with our endpoint, let's now generate a bunch of sub tasks

Similar to our earlier example, we'll start by defining our pydantic model

class SubTask(BaseModel):
  """
  This is a class representing a sub task that must be completed in order for the user's request to be fulfilled.

  Eg. I want to have a login form that users can use to login with their email and password. If it's a succesful login, then we should display a small toast stating Congratulations! You've succesfully logged in.

  We can decompose our tasks into sub tasks such as
  - Create a input form that takes in an email
  - Create a primary button that has an onclick event
  - display a toast using the useToast hook

  and so on.
  """
  task:str = Field(description="This is an instance of a sub-task which is relevant to the user's designated task")

We can then use the MultiTask function from the instructor library to get a variety of different tasks using the function below.

def generate_sub_tasks(query):
  completion = openai.ChatCompletion.create(
    model="gpt-4",
    temperature=0.3,
    stream=False,
    functions=[MultiTaskObjects.openai_schema],
    function_call={"name": MultiTaskObjects.openai_schema["name"]},
    messages=[
        {
          "role": "system",
          "content": f"You are a senior software engineer who is very familiar with NextJS and the @shadcn/ui libary. You are about to be given a task by a user to create a new NextJS component."
        },
        {
          "role":"assistant",
          "content": "Before you start on your task, let's think step by step to come up with some sub-tasks that must be completed to achieve the task you are about to be given."
        },
        {
          "role":"assistant",
          "content":f"Here are some of the components avaliable for use\n{COMPONENTS}"
        },
        {
            "role": "user",
            "content": f"{query}",
        },
    ],
    max_tokens=1000,
  )
  queries = MultiTaskObjects.from_response(completion)

  # General Syntax to get the individual tasks out from the MultiTaskObjects object
  return [i.task for i in list(queries)[0][1]]

Note here that we're also passing in a list of all the components that our llm has access to. This looks like

COMPONENTS = """
-Typography:Styles for headings, paragraphs, lists...etc
-Accordion:A vertically stacked set of interactive headings that each reveal a section of content.
-Alert:Displays a callout for user attention.
-Alert Dialog:A modal dialog that interrupts the user with important content and expects a response.
-Aspect Ratio:Displays content wi
# other components go here
"""

I found that this helped the LLM to make more accurate decisions when it came to sub tasks since it was able to identify precisely what components it needed. Once we've extracted out the sub tasks, we can then proceed to get the relevant code chunks by computing embeddings.

Getting Embeddings

I found that chromadb had a slightly awkward way of returning embeddings. Ideally I want it in a giant object which contains just the relevant code chunks but it returns it as a map of

{
  ids: [id1,id2]
  metadatas: [metadata1,metadata2],
  embeddings: [embedding1, embedding2]
}

So, I wrote this little function to extract out the top 3 most relevant code chunks for our example.

elevant_results = collection.query(
    query_texts=[user_prompt],
    n_results=3
)

ctx = []
uniq = set()
for i in range(len(relevant_results["metadatas"])):
    for code_sample,sample_query in zip(relevant_results["metadatas"][i],relevant_results["documents"][i]):
        if sample_query not in uniq:
            ctx.append(f"Eg.{sample_query}\n```{code_sample}```\n")
            uniq.add(sample_query)

ctx_string = "".join(ctx)

Note here that I'm using the sample_query instead of the code chunk as the unique key for deduplication. This is intentional. I thought that it would be useful to provide a few different ways that the same code chunk could be used.

This then produces a long context string that looks like this

Eg. How to code an input box
// code chunk goes here

Which we compile and feed in as context.

Prompting our Model

I found that by using a vanilla prompt, GPT-4 tended to return either

Invalid Code - it would hallucinate and create imports from other libraries such as @chakra-ui
It would return a short remark before giving me the actual code (Eg. Sure, I can give you the code ... (react code goes here ))

In order to fix this, I implemented two solutions

Rules - these are pieces of information that the model needs to follow. I thought it would be more useful to add it at the end of the prompt because models tend to add additional weight to information placed at the start and end of their prompts
A Pydantic Model - This would force it to return code

Rules were not overly complex, I ended up using the following after some experimentation. These are useful because libraries tend to have library-specific implementation details and having rules ensures that your model respects them

Here are some rules that you must follow when generating react code

Always add a title and description to a toast
   onClick={() => {
   toast({
      title: // title goes here,
    description: // description,
    })
}}
Make sure to only use imports that follow the following pattern

'React'

'@/components/ui/componentName`

'next/'

No other libraries are allowed to be used

In terms of the pydantic model, I didn't use anything overtly complex.

class CodeResult(BaseModel):
  """
  This is a class representing the generated code from a user's query. This should only include valid React Code that uses the @shadcn/ui library. Please make to conform to the examples shown
  """
  code:str

I found that the above code definition was good enough to achieve consistent react code generated throughout all my individual tests. I then utilised gpt-4 to generated the code

def generate_code(ctx_string,user_prompt):
  gen_code: CodeResult = openai.ChatCompletion.create(
    model="gpt-4",
    response_model=CodeResult,
    max_retries=2,
    messages=[
        {
          "role": "system",
          "content": f"You are a NextJS expert programmer. You are about to be given a task by a user to fulfil an objective using only components and methods found in the examples below."
        },
        {
            "role":"assistant",
            "content":f"Here are some relevant examples that you should refer to. Only use information from the example. Do not invent or make anything up.\n {ctx_string}"
        },
        {
          "role":"assistant",
          "content":f"Please adhere to the following rules when generating components\n {RULES}"
        },
        {
            "role": "user",
            "content": f"{user_prompt}",
        },
    ]
  )
  return gen_code.code

Frontend

Rendering the Generated Code

The Frontend proved to be a little bit difficult because nextjs would not allow me to inject in the react code generated. This makes sense since normally, we write react code, which is then bundled into a source map and javascript files before it's rendered in the browser.

So, as a workaround, I realised that I could use a client side dynamic import to render a component. On the backend, we can support this by adding the following snippet into our route.

with open('../src/generated/component.tsx', 'w') as f:
    f.write(generated_code)

We can then import in this component on the frontend by defining a dynamicly imported component in nextjs

import dynamic from "next/dynamic";

const DynamicHeader = dynamic(() => import("../generated/component"), {
  loading: () => <p>Loading...</p>,
  ssr: false, // add this line to disable server-side rendering
});

Once this was done, we could just import it like any other component and use it. I wrote up a quick and dirty client componet using server actions that would achieve this result.

"use client";
import { Button } from "@/components/ui/button";
import { Input } from "@/components/ui/input";
import { Label } from "@/components/ui/label";
import { Textarea } from "@/components/ui/textarea";
import { useToast } from "@/components/ui/use-toast";
import { ClearComponent, SubmitPrompt } from "@/lib/prompt";
import dynamic from "next/dynamic";
import React, { useState, useTransition } from "react";
import { ClipLoader } from "react-spinners";
import { Tabs, TabsContent, TabsList, TabsTrigger } from "@/components/ui/tabs";

const DynamicHeader = dynamic(() => import("../generated/component"), {
  loading: () => <p>Loading...</p>,
  ssr: false, // add this line to disable server-side rendering
});

const UserInput = () => {
  const [isPending, startTransition] = useTransition();
  const [userInput, setuserInput] = useState("");
  const [generatedCode, setGeneratedCode] = useState("");

  return (
    <div className="max-w-xl mx-auto">
      <h1>GPT-4 Powered React Components</h1>

      <form
        className="my-4 "
        onSubmit={(e)=> {
          e.preventDefault();
          startTransition(()=> {
            SubmitPrompt(userInput).then((res)=> {
              setGeneratedCode(res["code"]);
            });
          });
        }}
      >
        <Label>Prompt</Label>
        <Textarea
          value={userInput}
          onChange={(e)=> setuserInput(e.target.value)}
          placeholder="I want to create a login form that..."
        />
        <div className="flex items-center justify-end mt-6 space-x-4">
          <Button type="submit">Submit</Button>
        </div>
      </form>

      {isPending ? (
        <>
          {" "}
          <ClipLoader speedMultiplier={0.4} size={30} /> <span>Generating Component...</span>
        </>
      ) : generatedCode.length= 0 ? (
        <p className="text-center">No Component Generated yet</p>
      ) : (
        <Tabs defaultValue="code">
          <TabsList className="grid w-full grid-cols-2">
            <TabsTrigger value="code">Code</TabsTrigger>
            <TabsTrigger value="component">Component</TabsTrigger>
          </TabsList>
          <TabsContent value="code">
            <code className="relative rounded px-[0.3rem] py-[0.2rem] font-mono text-sm">
              {generatedCode.split("\n").map((item) => (
                <div className="py-1 px-4 bg-muted">{item}</div>
              ))}
            </code>
          </TabsContent>
          <TabsContent value="component">
            <div className="mt-6 border px-6 py-6">
              <DynamicHeader />
            </div>
          </TabsContent>
        </Tabs>
      )}
    </div>
  );
};

export default UserInput;

Takeaways

This was a very fun project to implement as a quick weekend hack and again, it is not production level code at all, not even close.

Main Problems

I think there are a few main things that plague this project

A lack of bundler validation
Insufficent Data
Hacky way of rendering components

One of the best ways to validate the react code is to have access to a javascript bundler which can deterministically tell you if a given chunk of code is valid or not.

However, I'm not too sure how to do that at all using python. As a result, sometimes we would end up with code that was almost correct but was missing a few imports. This would have been easy to catch if we had access to a bundler

Another thing that my model suffers from is that of a simple dataset. The embedding search and lookup is as good as the quality of examples that we feed into it. However, since the tasks are being manually generated by GPT-4 and we have no easy way of evolving the given examples into more complex ones without spending more time on it, this limits the ability of our model to generate complex examples.

I personally found that when we gave it more complex examples, it started making references to objects it didn't know how to mock. Code like the following would be generated

<span>{user.email}</span> // user is not mocked at all, throwing an error

Lastly, rendering components using a dynamic import is very hacky. I personally don't think it's the best way but I was unable to think of an easy way to bundle the generated code and injecting it as html to my endpoint.

Future Improvements

If I had more time to work on this, I'd probably work on the following things

Fine Tuning : A large chunk of these tasks can probably rely on a well tuned gpt3.5 model. Examples could be generating small and specific task files (Eg. helper functions, simple input and label forms)
Using a LLM Validator : I recently learnt that instructor also provides a llm_validator class that you can validate outputs against. That might have been a better place to use the rules validation that we defined.
Using multiple agents : I think that this project is best served if we could look towards more specialised agents for specific tasks (Eg. One for design, one for planning, another for writing tests etc). Lindy does something similiar with calendar and emails.
Dataset Improvements : It would be interesting to use something like evol-instruct to come up with either more complex prompts in our database or more complex code examples that show how to build increasingly difficult schemas. I strongly believe this would have improved the quality of the generatde output.