> [!hint] Did I actually save time using AI?
> I'm tempted to think, actually, no. This is an unfortunate conclusion, but it's also note precisely apples to apples because this project simply wouldn't exist if I couldn't ask the AI to do it for me.
The code-for-me adventure continues. Other posts in this impromptu series:
- [[A review of o1 Pro so far]]
- [[Building side projects with AI? Time to procrastinate]]
- [[Gemini Experimental Advanced vs O1 Pro]]
My latest side project, a speech-to-text app called Heyo, is essentially feature complete at this point. One of the last remaining todos was to add a server mode so that I could also use it as an API. The idea is to be able to use this app as a piece of any transcription pipeline.
This seems pretty straightforward to do with the web technologies i'm used to using (Typescript, Go, Python) but seemed a bit less straightforward with Swift. Swift after all is not typically used to create web servers.
Of course, it is a general purpose programming language so it can be done. When I had o1 pro take on this task it _almost_ solved it straight away. However, it took a rather strange approach: It implemented manual HTTP parsing logic, and it implemented it in a very simplistic way. It felt like code the intern might write[^1].
Since this is both a personal project and an "AI project"[^2] I wouldn't have minded, but it failed in an odd way.
## The server
The server has essentially one single endpoint which lets the user post a file to be transcribed. This mimics the OpenAI whisper API.
```http
POST /v1/audio/transcriptions
content-type: multipart/form-data; boundary=form-data-boundary-abc
--form-data-boundary-abc
Content-Disposition: form-data; name="model"
whisper-1
--form-data-boundary-abc
Content-Disposition: form-data; name="file"; filename="recording.mp3"
Content-Type: application/octet-stream
[the binary mp3 data]
--form-data-boundary-abc--
```
In the Swift code the AI wrote a custom "parser" (i'm using the term quite loosely) which would handle multipart form encoded HTTP bodies. This is all very standard stuff, but I suppose the AI hasn't seen much training data on implementing it directly since it half-way failed.
The problem manifested as some requests working fine while others failed. Initially I was testing the endpoint using `curl` which worked great. Perhaps unsurprising since I had given an example curl request to the AI and said "this should work". It did work. However, things got more odd when I tried making the request using Node.js.
The first oddness was simply a Node.js quirk, and was not the fault of the AI-written server:
- Lots of examples of uploading a file with Node use `fs.createReadStream` with `fetch`. _However_, that is actually a behavior of the third-party `node-fetch` library and not the built-in `fetch` function which now comes standard with node.
- The node `fetch` docs simply say it's a "web compatible" implementation, which is thoroughly unhelpful when it comes to file uploads since a web browser cannot itself read files from disk and prepare them for upload. This is something specific to Node.js, so... what the hell Node?
- Luckily someone wrote a blog post explaining that it can be done using `fs.openAsBlob`, an API i had never heard of until this point which is actually still experimental.
- This seems to mean that while Node has `fetch` support its ability to upload files is actually experimental. Not to mention all tha talk of "web compatible" seems to ignore the obvious fact that Node.js processes have needs outside the scope of browsers, and `fetch` should support that. This is the kind of gotcha that creeps up all the time in the JS world, unfortunately.
The end result of all this was that the node process was not actually sending a file at all but was sending the infamous JS string `[object Object]`, which is what you get when you try to send something as a string thats not a string. Using `fs.openAsBlob` was a step in the right direction, but then resulted in `[object Promise]` being sent in place of an actual audio file.
Easy enough to resolve (`await` the promise), but I'm left wondering how `fs.openAsBlob` actually works. Why do we need to `await` it? ~~Does it load the whole file into memory?~~ It does not load the whole file into memory[^3]. Not sure why it must be awaited through, when `fs.createReadStream` is a synchronous call but reading chunks is async.
The next bit of oddness was that I was getting different results when uploading with Node.js vs Bun. Bun is not always 100% node compatible but it seemed like things should still work looking at the code and the requests.
The first bit of code, uploading directly with the built-in `fetch` function worked with both Bun and Node:
```js
import fs from "node:fs";
const filePath = new URL("test-recording.mp3", import.meta.url);
const fileBlob = await fs.openAsBlob(filePath);
const formData = new FormData();
formData.append("file", fileBlob);
const response = await fetch("http://localhost:18932/v1/audio/transcriptions", {
method: "POST",
body: formData,
});
const transcription = await response.json();
console.log(transcription);
```
However, when I tried using the OpenAI SDK Node succeeded while Bun failed:
```js
import fs from "node:fs";
import OpenAI from "openai";
const openai = new OpenAI({ baseURL: "http://localhost:18932/v1" });
const transcription = await openai.audio.transcriptions.create({
file: fs.createReadStream(new URL("test-recording.mp3", import.meta.url)),
});
console.log(transcription);
```
I had started out this project wanting to simply get it done quickly, but my interest was piqued so I decided to go deeper.
I opened up Wireshark to see what the requests actually looked like. What was the difference? Why would one succeed while the other fail?
> [!hint] I made sure Bun wasn't the issue
> I hit the official OpenAI API using Bun and everything worked as expected, so I could rule out Bun as the issue. Something about my server didn't like the request coming from Bun but it handled the request coming from Node just fine.
Glancing at the requests, nothing stood out immediately as different. The content length was not the same, so clearly there were some differences.
After some digging around I finally found the issue: Quoted form boundaries. Bun[^4] was defining its form boundaries using like so:
```
Content-Type: multipart/form-data; boundary="-WebkitFormBoundaryb28"
```
Both Node.js and `curl` like so, without quotes:
```
Content-Type: multipart/form-data; boundary=form-data-boundary-abc
```
The end result was that the AI-written server was using the quote as part of the boundary, and naturally that causes the whole parsing operation to fail. It can't find the relevant audio data nestled in amongst all the other bits.
Once the issue was apparent the AI was quickly able to rewrite the code to handle this.
# Conclusion
Was this actually any faster than writing the code myself? If I had written the code myself I would have been primed to see the differences in the requests and might not have made the mistake in the first place.
Of course, I don't know Swift so the DIY approach would involved learning a new programming language. That's not a big ask when there's a lot at stake, like when working on a new long-term codebase. But this was a side project, and Swift is only good for writing Apple apps, so I was disinclined to learn the language itself.
It's safe to say that I would not have gotten this project done without the help of AI because I simply wouldn't have started it. But it also took much more time than I had expected.
[^1]: Or code that my past self might write. Intern software engineers generally come from a CS background and would, I assume, be more prone to come up with a robust parsing solution rather than the string-splitting approach the AI used.
[^2]: As in, a project where you don't care about the code quality and only care that it works. The point of using the AI is to save time to get to a solution, not to craft an artisanal solution.
[^3]: https://github.com/nodejs/node/issues/45188
[^4]: Bun uses the same internals as Webkit (Safari) so this way of encoding form boundaries likely is not Bun-specific but rather Webkit specific.