A few decisions I'm least settled on, and would love some pushback/feedback on:
- single arm vs. bimanual (I went single for cost/space, knowing it rules out things like folding cloth)
- not calibrating camera extrinsics/intrinsics for now
- RGB vs. RGB-D for from-scratch policies (ACT / Diffusion Policy)
And one I'm more confident about but expect disagreement on: not building on ROS 2 / LeRobot, and writing my own stack instead. Happy to get into the reasoning.
>I do not intend to calibrate the camera’s extrinsics or intrinsics for now.
Sensible choice, although I suggest it's good in the long run to do at early stage in your setup, especially if you intend to collect data for policy learning.
Debugging trained policies for visual manipulation task can be a headache and having as much context on variables that can change is a good practice.
My previous setup was in Japan, a earthquake prone place and I wasted some time after realizing the camera got misaligned due to earthquake. A simple solution is just to place an Aruco marker on the table that tracks the relative extrinsic position of camera, and add it as metadata to collected teleoperation dataset.
Right now the static camera is probably really bad: It's mounted on my desk, so its very easy to bump into it and move it. So yeah, it's position for sure will change over time. I think I need a better solution, maybe a rail system that's more rigidly attached to the robot arm so that at least the camera stays fixed relative to that point of reference.
My project is https://github.com/colinator/Ariel - basically, no VLAs - instead, "just write code". Or have the agents do it.
I don't have a writeup yet about applying Ariel to _this_ robot, but this is for a previous one: https://colinator.github.io/Ariel/post1.html.
Excited to follow your progress!
And yes, having a nice robot makes life so much easier. Very happy with my choice so far!
- Calibration is not required for VLA models.
- RGB or Stereo RGB inputs are sufficient for ACT, DP, and PI0/PI05.
- ROS2 is not strictly required, but it can be useful for sharing/co-developing codes. For instance, the Stanford team built a custom framework for diffusion policy instead. I also developed similar framework because ROS2 is not optimized for bi-manual manipulation or VLA workloads.
Also any thoughts on action space representation? Seems to me people are settling on flow matching mostly, but pi still uses discrete tokens to supervise the upstream backbone VLM. I also like the simplicity of discrete bins and used that successfully in the past.
PI smartly combined discretized tokens with flow-matching for efficient training, and it works well in most cases. Still, end-effector representation may be better for teleop with devices like a SpaceMouse, VR, or VibeTracker. PI-07 also supports EEF, but I am not sure how much data is needed to fine-tune PI-05 for that.
I'd suggest starting with the default pi05 model. Data strategy is probably more important than model improvements. Since VLA performance is highly dependent on the data/action distribution and it's easy to modify. After that, you can add high-level reasoning like PI05. I visited a Chinese VLA company that already adopted the PI-05 approach, and it works quite well in practice.
For depth I agree on the VLA route but for ACT / DP-style imitation learning from scratch it seems more feasible (since you’re not fighting a pretrained model that was not trained on this modality). Might also increase robustness since you naturally end up with an input that’s invariant to colors / textures. Plan is to try both paths: the from scratch (and then ablate RGB vs RGB-D) and the VLA + fine-tuning one.
There is already plenty of research around multimodal diffusion policies. While DP typically doesn't require pre-training, you can boost data size by depth estimation model+Open data.
There should exist a minimal, clear, robotics library like what you’ve built. The Flask of the robotics world.
I also agree on the need for a simple, easy, extensible open source framework. LeRobot IMO is some of this but also contains the dataset + ML code. I think Flask is nice because it's so singularly focused on just one thing with extensibility if you need extras.
So I really like the idea. But being an OSS maintainer these days seems... intense.
Would like to know your reasoning on not going with LeRobot.
Re why not SO-101: the article has a footnote about this; I actually bought the SO-101 as well! I want to integrate it into the same setup so I can switch depending on task.
Somewhat surprisingly the xarm was actually much faster to arrive; I got it within 2 days of ordering. I don’t have a 3D printer and getting the SO-101 from the vendor I ordered it at took almost 4 weeks. So partially it just came down to what I had access to more quickly.
Second point is reliability: I think the SO-101 is cool but I’d be surprised if it doesn’t break more quickly than the xarm. I wanted something that’s going to last a long time without headaches. And these industrial arms are really mature hardware wise now.
Hope this helps!
Will email you to compare notes.
How have you found it?
(The author does explain his reasons for not using LeRobot in the post - although "I also use LeRobot for training and running baseline policies, and the vendor SDKs for the hardware.")
Ah! That is exactly what I use it for as well :-)
I am liking the SO101 - teleop and robot both work great. For sure, it is very easy to get started with. I was able to collect around 50 demos with them and train my first ACT policy within days of setting up the robots. Happy to share more detailed learnings from this if/when you get started with it. https://github.com/avilay/learn-robotics
I wanted to build something making bows automatically and damn, pretty complex and expensive to buy the parts.
I was like, I can probably just buy a 3D printer, print all the parts and buy some motors, but it seems it's way more complicated than that.
Like to play with a single hand robot. It looks like you need 10k+$. I wanted to spend max 1k, 3D print parts, buy motorized parts on Alibaba, and code on my Mac + spare GPUs I have access to. I'll have to save a little :(
And hardware is very fun! It’s also very frustrating. But to me worth it.
Reminds me of https://rodneybrooks.com/why-todays-humanoids-wont-learn-dex... which is basically a stark warning against the hype.
And yeah I feel you re humanoid. I worked on the Rubik's cube project at OpenAI, which used a humanoid hand, and it was insanely painful and hard. Also fun anecdote: it was completely impossible to teleop the shadow hand. We had a data glove to capture hand movements but as soon as contact / haptics come in, you're lost. We could never even get a single rotation on the Rubik's cube via teleop.
I do think simpler hardware like the one described in my post works really though and it's so much easier to do something with it.
I would have liked to explore a bit for but doing innovation readiness on technologies related to XR I basically had to move on to the next project. This though happened mostly because, as pointed out by the essay by Brooks, dexterity is hard, on both ends. Namely yes one can get a basic robot "arm" for cheap... but doing a robotic arm with a hand is something else, one with fine motor control is ... well I'm not an expert in the field but basically it doesn't exist yet IMHO. Sure we have grippers but that has basically nothing to do with a human hand. It's amazing how much flexibility we have at our disposal in such a compact and efficient form, sensors for touch obviously but also heat, proprioception of course, part that are smooth and flexibly while other are hard. The range of craftmanship we can do is... mindblowing. If you don't believe me just look at a basic magician, not even a good one (like me, I confess), doing sleight of hand, it's just amazing.
So that was on the robotic part, discovering what it could do, amazing, but more importantly for my work what it could NOT do.
A seemingly simple project was how to remove a 3D print from our 3D printer in order to free it up and move to the next job. This sounds trivial ... until you try to actually do it. I won't get into details but we didn't manage. It's of course feasible in the ideal scenario, e.g. successful print that is mostly rigid with attachment and support to the plate that requires just the right amount of force. It can be done. Now doing that in a realistic set of scenarii that a 3D printing house would do... well maybe it's feasible but I professionally didn't know (and still don't know) how to do with a realistic set of constraints (time wise, economically speaking too).
Moving on then to the other end, or hand (sorry for the pun) tracking in VR is good, honestly. It's quite fun for games... then trying to do so in a professional scenario then 1mm difference or occlusion for .1s is not acceptable anymore.
TL;DR (sorry I have to run and it might be longer than what you even asked for!) : the concept itself is obviously good, especially from a programmer standpoint. We are expert at automating, in fact I'd argue that's the 1 thing we excel at. The implementation though, in real life, is much harder that we naively consider, even with a LOT more computing power.
TL;DR (short): quick wins, yes, harder wins... not intractactable but at least beyond my own ability.
On attachments: during this project I really wanted a 3D printer several times. So that's probably next on the shopping list.
The Ufactory arm is actually quite extensible: it exposes digital input/output and you have a standard wrist mount where you can mount different end effectors or attachments.
Something I’m working on is a hardware CLI for agents to run experiments, with a “CICD” pipeline that validates everything and means I can delegate more of the experiments to the agents. I wonder if you have any thoughts on this?
The idea is to allow the coding agent to run the full loop of experiments and validations, with vision, audio, button pressing, speaking etc to interact in place of the human
Have you seen the recent nvidia thing? They do this at scale for robotics manipulation: https://research.nvidia.com/labs/gear/enpire/
I'm finding a gap just before running those experiments.
The process of updating firmware, doing basic smoke tests on each device and validating it is live, and can function correctly.
Basically the pre-deployment green light that you get on github, but for hardware.
Have you seen or thought about that at all?
Of course, it's impossible to know for sure what was LLM processed or not, but some of your posts (like this one) are getting classified that way.
I think pushing the sota is quite hard to do solo but we'll see. Mostly I want to get back up to speed after having not done much robotics during the last 6 years. Best way for me to learn is to just do it, so here we are. We'll see how far I get (I suspect at some point compute will be the main bottleneck)
- I've heard the advantage of ROS besides the architecture is the ecosystem (driver integrations, etc). Is that not an issue because the arm supports a Python SDK OOTB?
- Any issues you've been running into with this setup?
- How do you determine if a session recording is good enough for training? Is 50/100 samples really all you need?
Re your questions:
- The driver situation turned out totally fine; I intentionally picked HW with good python sdk support so that was very painless.
- The static camera (the C920) is not super great; it drops frames and sometimes cuts out. We’ll see how that goes but it’s probably the clostest thing I want to swap right now. Another issue is reach of the arm when forcing the worst to be axis parallel with the table; you cannot get very far away. The chess setup demo in the video gives an example: I can just reach the row of pawns and beyond that it’s out of reach.
- I don’t know yet! The 50-100 figure comes from the ACT and diffusion policy papers but it depends on the type of task. For fine tuning my sense is that you only need a few hours worth of demos to get good results with pi0.5 etc. a big reason I’m doing this project is that I want to try all of this myself, so the next posts definitely will talk about that
That being said, I might switch to a realsense for the static tabletop camera as well; the realsense wrist is clearly much more reliable than the cheap Logitech C920 that I currently use.
Time is one of the hard problems in robots, because they are inevitably but non-obviously distributed systems.
Robots are annoyingly, wonderfully difficult.
It would be interesting to explore how RL can be applied on top of my (flawed) human demos to optimize beyond what I’m able to do.
Tell me more! I am slightly biased in that direction. But can’t fully justify it at this point.
I am not an official supporter of the library but am asking out of curiosity.
On control: LeRobot will change all the time and I’ll be unaware of what changed. If something suddenly doesn’t work anymore, it’s a pain to find out. I can of course fork or pin but that defeats the purpose a bit.
At the end it’s also partially just preference: I wanted to write this layer myself, and I have opinions about how it should be architected, so I did.
The app did a decent job at surfacing problematic comments that a mod can do something about.
It was cool to optimize llama-cpp arguments for throughput. During the slightly off-peak hours the post processing was pretty much real time. I suspect a second 3090 would’ve be enough for peak posting hours too.