Building a robotics research setup that lives next to my desk

177 points by mplappert 7 days ago|60 comments

Quick framing, since the post is long: I did robotic manipulation research at OpenAI from 2017–2020, and the tabletop setup back then cost roughly 10x this one and took a team to run. This project is me testing whether a single person can now do meaningful work on the same class of problems: starting with physical and software setup.

A few decisions I'm least settled on, and would love some pushback/feedback on:

- single arm vs. bimanual (I went single for cost/space, knowing it rules out things like folding cloth)

- not calibrating camera extrinsics/intrinsics for now

- RGB vs. RGB-D for from-scratch policies (ACT / Diffusion Policy)

And one I'm more confident about but expect disagreement on: not building on ROS 2 / LeRobot, and writing my own stack instead. Happy to get into the reasoning.

•

NalNezumi 6 days ago

Cool stuff. At my previous (startup / research) job I had to set up similar system (but with franka arm and multi view camera) alone because I was the only one with robotics background.

>I do not intend to calibrate the camera’s extrinsics or intrinsics for now.

Sensible choice, although I suggest it's good in the long run to do at early stage in your setup, especially if you intend to collect data for policy learning.

Debugging trained policies for visual manipulation task can be a headache and having as much context on variables that can change is a good practice.

My previous setup was in Japan, a earthquake prone place and I wasted some time after realizing the camera got misaligned due to earthquake. A simple solution is just to place an Aruco marker on the table that tracks the relative extrinsic position of camera, and add it as metadata to collected teleoperation dataset.

•

mplappert 6 days ago

Great points and I very much appreciate the input!

Right now the static camera is probably really bad: It's mounted on my desk, so its very easy to bump into it and move it. So yeah, it's position for sure will change over time. I think I need a better solution, maybe a rail system that's more rigidly attached to the robot arm so that at least the camera stays fixed relative to that point of reference.

•

colinator 6 days ago

Good move getting a nice robot! I'm doing something similar to you, but I went with a cheap robot, the "HIWONDER 6DOF Robotic Arm Kit". It was only $600, but wow is it bad. The precision and repeatability are both "are you drunk?" level. I can hear the gears grind when it moves. I suppose I should upgrade. But if my system can work with a terrible robot, I assume it would work even better with a nice one!

My project is https://github.com/colinator/Ariel - basically, no VLAs - instead, "just write code". Or have the agents do it.

I don't have a writeup yet about applying Ariel to _this_ robot, but this is for a previous one: https://colinator.github.io/Ariel/post1.html.

Excited to follow your progress!

•

mplappert 6 days ago

Wow very cool project and thanks for the Ariel pointer! Added to my reading list.

And yes, having a nice robot makes life so much easier. Very happy with my choice so far!

•

b89kim 5 days ago

- A single arm is sufficient for validating basic Pick/Place tasks, but more complex scenarios require Bi-arm

- Calibration is not required for VLA models.

- RGB or Stereo RGB inputs are sufficient for ACT, DP, and PI0/PI05.

- ROS2 is not strictly required, but it can be useful for sharing/co-developing codes. For instance, the Stanford team built a custom framework for diffusion policy instead. I also developed similar framework because ROS2 is not optimized for bi-manual manipulation or VLA workloads.

•

mplappert 5 days ago

Did you ever a late RGB vs RGB-D? I haven’t seen that much in the literature.

Also any thoughts on action space representation? Seems to me people are settling on flow matching mostly, but pi still uses discrete tokens to supervise the upstream backbone VLM. I also like the simplicity of discrete bins and used that successfully in the past.

•

b89kim 5 days ago

Adding a depth channel rarely yields a massive performance gain, likely due to data scarcity and the fact that modern VLAs are good at guessing distance directly from RGB. I have used multiple RGB-D cameras, but it is hard to get stable images without jitter. Depth can still be useful for high-level reasoning. PI also uses bounding-box or segmentation data from PI-05 for that.

PI smartly combined discretized tokens with flow-matching for efficient training, and it works well in most cases. Still, end-effector representation may be better for teleop with devices like a SpaceMouse, VR, or VibeTracker. PI-07 also supports EEF, but I am not sure how much data is needed to fine-tune PI-05 for that.

I'd suggest starting with the default pi05 model. Data strategy is probably more important than model improvements. Since VLA performance is highly dependent on the data/action distribution and it's easy to modify. After that, you can add high-level reasoning like PI05. I visited a Chinese VLA company that already adopted the PI-05 approach, and it works quite well in practice.

•

mplappert 5 days ago

This all makes a lot sense, thanks for sharing!

For depth I agree on the VLA route but for ACT / DP-style imitation learning from scratch it seems more feasible (since you’re not fighting a pretrained model that was not trained on this modality). Might also increase robustness since you naturally end up with an input that’s invariant to colors / textures. Plan is to try both paths: the from scratch (and then ablate RGB vs RGB-D) and the VLA + fine-tuning one.

•

b89kim 5 days ago

If you're using depth, you're better off starting with a diffusion policy (DP). We benchmarked ACT, DP, pi0,pi05 on the same task, ACT underperformed in most cases.

There is already plenty of research around multimodal diffusion policies. While DP typically doesn't require pre-training, you can boost data size by depth estimation model+Open data.

•

delbronski 5 days ago

Really cool. Good move on writing your own stack. I was weighing using ros2 or building our own, but ended up going the ros2 route to “save time”. We are working with autonomous mobility robots in nature, so we figured the ecosystem around ros2 would be worth the compromise. It was not. All the time we saved setting things up initially, we are paying for now.

There should exist a minimal, clear, robotics library like what you’ve built. The Flask of the robotics world.

•

mplappert 5 days ago

Yeah I tried to use ROS / Gazebo about 10 years ago (it was still ROS 1) and getting it set up was an immense pain. I remember that creating new modules required writing CMake files. Maybe this is better now, but I decided to skip that.

I also agree on the need for a simple, easy, extensible open source framework. LeRobot IMO is some of this but also contains the dataset + ML code. I think Flask is nice because it's so singularly focused on just one thing with extensibility if you need extras.

So I really like the idea. But being an OSS maintainer these days seems... intense.

•

avilay 6 days ago

Hey this is cool! I am doing something similar myself with the SO101 arm robot from Robot Studio using a patchwork of my own code and LeRobot. Would love to collaborate with you if you are open to it. You can find me on Discord as `.avilay`. https://www.linkedin.com/posts/avilay_lerobot-huggingface-ro...

Would like to know your reasoning on not going with LeRobot.

•

mplappert 6 days ago

Looks very cool! I’m not a huge discord user but how about you shoot me an email and we can figure out how to share notes? (I don’t want to post it directly here but it’s easy to find on my personal website, just google my name)

Re why not SO-101: the article has a footnote about this; I actually bought the SO-101 as well! I want to integrate it into the same setup so I can switch depending on task.

Somewhat surprisingly the xarm was actually much faster to arrive; I got it within 2 days of ordering. I don’t have a 3D printer and getting the SO-101 from the vendor I ordered it at took almost 4 weeks. So partially it just came down to what I had access to more quickly.

Second point is reliability: I think the SO-101 is cool but I’d be surprised if it doesn’t break more quickly than the xarm. I wanted something that’s going to last a long time without headaches. And these industrial arms are really mature hardware wise now.

Hope this helps!

•

avilay 6 days ago

Thanks for your response! I totally get your point about the delay in getting the robot, I ordered mine from PartaBot and they did take a couple of weeks to get here. But when they did, they worked great out-of-the-box :-)

Will email you to compare notes.

•

nl 6 days ago

I'm very interested in the SO101. I've never done any robotics and that seems a palatable entry level thing to try things out.

How have you found it?

(The author does explain his reasons for not using LeRobot in the post - although "I also use LeRobot for training and running baseline policies, and the vendor SDKs for the hardware.")

•

avilay 6 days ago

> The author does explain his reasons for not using LeRobot in the post - although "I also use LeRobot for training and running baseline policies, and the vendor SDKs for the hardware.")

Ah! That is exactly what I use it for as well :-)

I am liking the SO101 - teleop and robot both work great. For sure, it is very easy to get started with. I was able to collect around 50 demos with them and train my first ACT policy within days of setting up the robots. Happy to share more detailed learnings from this if/when you get started with it. https://github.com/avilay/learn-robotics

•

Frannky 5 days ago

This is so cool and looks so fun. I want to play with robots. I just play with software :(

I wanted to build something making bows automatically and damn, pretty complex and expensive to buy the parts.

I was like, I can probably just buy a 3D printer, print all the parts and buy some motors, but it seems it's way more complicated than that.

Like to play with a single hand robot. It looks like you need 10k+$. I wanted to spend max 1k, 3D print parts, buy motorized parts on Alibaba, and code on my Mac + spare GPUs I have access to. I'll have to save a little :(

•

mplappert 5 days ago

I’d recommend looking at the SO-101 then. Much much cheaper.

And hardware is very fun! It’s also very frustrating. But to me worth it.

•

utopiah 6 days ago

Ah nice I've tinkered a bit with robotics few years ago, e.g. https://twitter-archive.benetou.fr/utopiah/status/1760260544... and it's pretty empowering... but also showcases how amazing the human body is. Our dexterity, ability to sense, etc.

Reminds me of https://rodneybrooks.com/why-todays-humanoids-wont-learn-dex... which is basically a stark warning against the hype.

•

mplappert 6 days ago

Cool video! What made you stop if I can ask?

And yeah I feel you re humanoid. I worked on the Rubik's cube project at OpenAI, which used a humanoid hand, and it was insanely painful and hard. Also fun anecdote: it was completely impossible to teleop the shadow hand. We had a data glove to capture hand movements but as soon as contact / haptics come in, you're lost. We could never even get a single rotation on the Rubik's cube via teleop.

I do think simpler hardware like the one described in my post works really though and it's so much easier to do something with it.

•

utopiah 6 days ago

Thanks! I did have fun tinkering at the intersection of both media.

I would have liked to explore a bit for but doing innovation readiness on technologies related to XR I basically had to move on to the next project. This though happened mostly because, as pointed out by the essay by Brooks, dexterity is hard, on both ends. Namely yes one can get a basic robot "arm" for cheap... but doing a robotic arm with a hand is something else, one with fine motor control is ... well I'm not an expert in the field but basically it doesn't exist yet IMHO. Sure we have grippers but that has basically nothing to do with a human hand. It's amazing how much flexibility we have at our disposal in such a compact and efficient form, sensors for touch obviously but also heat, proprioception of course, part that are smooth and flexibly while other are hard. The range of craftmanship we can do is... mindblowing. If you don't believe me just look at a basic magician, not even a good one (like me, I confess), doing sleight of hand, it's just amazing.

So that was on the robotic part, discovering what it could do, amazing, but more importantly for my work what it could NOT do.

A seemingly simple project was how to remove a 3D print from our 3D printer in order to free it up and move to the next job. This sounds trivial ... until you try to actually do it. I won't get into details but we didn't manage. It's of course feasible in the ideal scenario, e.g. successful print that is mostly rigid with attachment and support to the plate that requires just the right amount of force. It can be done. Now doing that in a realistic set of scenarii that a 3D printing house would do... well maybe it's feasible but I professionally didn't know (and still don't know) how to do with a realistic set of constraints (time wise, economically speaking too).

Moving on then to the other end, or hand (sorry for the pun) tracking in VR is good, honestly. It's quite fun for games... then trying to do so in a professional scenario then 1mm difference or occlusion for .1s is not acceptable anymore.

TL;DR (sorry I have to run and it might be longer than what you even asked for!) : the concept itself is obviously good, especially from a programmer standpoint. We are expert at automating, in fact I'd argue that's the 1 thing we excel at. The implementation though, in real life, is much harder that we naively consider, even with a LOT more computing power.

TL;DR (short): quick wins, yes, harder wins... not intractactable but at least beyond my own ability.

•

dracotomes 6 days ago

Very interesting article. I'm looking to get my first robot arm soon-ish, probably something in the SO-101 category. Can someone get reasonably far using recorded sessions compared to training in a simulated environment¹? Do you have any experience with third-party or DIY attachments for robot arms? I assume it's going to be more difficult for something like the Ufactory arm vs. the open models.

¹https://blog.comma.ai/mlsim/

•

mplappert 6 days ago

I'm only starting down this road but my sense is that ACT and Diffusion Policy both make it pretty feasible to start on real data only. LeRobot also makes it easier to train these. But that's the next step that I'm working on, so I don't know yet.

On attachments: during this project I really wanted a 3D printer several times. So that's probably next on the shopping list.

The Ufactory arm is actually quite extensible: it exposes digital input/output and you have a standard wrist mount where you can mount different end effectors or attachments.

•

sails 6 days ago

Love this, I’m playing around with the cheapo esp32+servos version of this, super fun.

Something I’m working on is a hardware CLI for agents to run experiments, with a “CICD” pipeline that validates everything and means I can delegate more of the experiments to the agents. I wonder if you have any thoughts on this?

The idea is to allow the coding agent to run the full loop of experiments and validations, with vision, audio, button pressing, speaking etc to interact in place of the human

•

mplappert 6 days ago

Very cool!

Have you seen the recent nvidia thing? They do this at scale for robotics manipulation: https://research.nvidia.com/labs/gear/enpire/

•

sails 6 days ago

Ah no I hadn't seen that, very interesting.

I'm finding a gap just before running those experiments.

The process of updating firmware, doing basic smoke tests on each device and validating it is live, and can function correctly.

Basically the pre-deployment green light that you get on github, but for hardware.

Have you seen or thought about that at all?

•

mplappert 6 days ago

This seems indeed useful. I haven't seen this for robotics but I'm sure people need this for larger deployments (either for a distributed fleet or for an "arm farm"-like setup where there's many robots in one location for data collection / eval). Interesting idea!

•

sails 5 days ago

Great thanks for giving it some thought, I’ll share a video as it’s all quite an early concept but saving me a lot of time on the bench!

•

killix 6 days ago

[flagged]

•

dang 5 days ago

Can you please not post AI-generated or AI-edited comments to HN? It's not allowed here - see https://news.ycombinator.com/newsguidelines.html#generated and https://news.ycombinator.com/item?id=47340079.

Of course, it's impossible to know for sure what was LLM processed or not, but some of your posts (like this one) are getting classified that way.

•

modeless 5 days ago

VR controllers are significantly better than a SpaceMouse for teleoperation. Even if I already had a SpaceMouse I would buy a Quest 3s and switch to it, if I was doing more than a trivial amount of teleoperation. (As is typical with robot teleop setups, I would not wear the headset on my head. It's merely a dongle to track the controllers).

•

mplappert 5 days ago

Yeah I can see that. Especially the rotation part is pretty awkward (hence why I currently do the axis parallel to table surface trick and only use the yaw angle to rotate the gripper). The space mouse was super easy to get started with though, so in terms of getting something up and running quickly it has an advantage.

•

thomasikzelf 6 days ago

Nice, I will be following your posts! I just bought a robot arm myself, the seeed studio B601DM (€1500 6+1 axis), it works great and is open source hardware as well and a bit more solid then the so101. I also opted to not use ros, I don't want to give up control by putting another framework in between. Is your plan to see whats possible right now or do you also have ideas on how to improve sota?

•

mplappert 6 days ago

Oh very cool! Looks a bit like the TRLC-DK1 (I was looking at this one for a bit).

I think pushing the sota is quite hard to do solo but we'll see. Mostly I want to get back up to speed after having not done much robotics during the last 6 years. Best way for me to learn is to just do it, so here we are. We'll see how far I get (I suspect at some point compute will be the main bottleneck)

•

thomasikzelf 6 days ago

It looks like one stole the design from the other, I don't know which one, haha.

•

wxw 6 days ago

Great article. I'll be following along. Would like to learn more about the robotics space.

- I've heard the advantage of ROS besides the architecture is the ecosystem (driver integrations, etc). Is that not an issue because the arm supports a Python SDK OOTB?

- Any issues you've been running into with this setup?

- How do you determine if a session recording is good enough for training? Is 50/100 samples really all you need?

•

mplappert 6 days ago

Glad you like it!

Re your questions:

- The driver situation turned out totally fine; I intentionally picked HW with good python sdk support so that was very painless.

- The static camera (the C920) is not super great; it drops frames and sometimes cuts out. We’ll see how that goes but it’s probably the clostest thing I want to swap right now. Another issue is reach of the arm when forcing the worst to be axis parallel with the table; you cannot get very far away. The chess setup demo in the video gives an example: I can just reach the row of pawns and beyond that it’s out of reach.

- I don’t know yet! The 50-100 figure comes from the ACT and diffusion policy papers but it depends on the type of task. For fine tuning my sense is that you only need a few hours worth of demos to get good results with pi0.5 etc. a big reason I’m doing this project is that I want to try all of this myself, so the next posts definitely will talk about that

•

b89kim 5 days ago

I could confirm 50-100 demonstrations are enough for fine-tuning pi0/pi05. I did research with aloha and humanoid. It works from 20~40ep(5~10min) but success rate would be 70~80%. Pi0 tech paper suggests to use over 1~4 hours of data. I could get 95% success rate for pick&place with 1 hour of humanoid. Anyway, required hours for good SR depend on generality of data. Long Horizon task over 5 min is not working as paper because PI removed high level(subtask) reasoning part in released pi05.

•

andned 6 days ago

I had a very similar setup. Really happy with the xarm 6 lite. I played around with the diffusion policy paper experiments and was thinking to buy a webcam as a top camera as well but I ended up buying two intel realsense ones because of the timestamp drift issues. How did you solve that? Or is camera feed syncing not necessary for your intended projects?

•

mplappert 6 days ago

I timestamp everything twice: once with the hardware clock (if available, like for the realsense camera) and once within my robot stack once it gets read from the device (using `time.monotonic_ns()`). Both are stored and alignment can happen with either timestamp. I think the 2nd timestamp is actually more meaningful since ultimately I want to reconstruct the state that the policy would've seen; so if one modality is delayed I should actually include that effect during training.

That being said, I might switch to a realsense for the static tabletop camera as well; the realsense wrist is clearly much more reliable than the cheap Logitech C920 that I currently use.

•

robotresearcher 6 days ago

Both timestamps are useful in different ways. The early-as-possible hardware stamp is best for reasoning about causality, while the later-and-full-o-jitter middleware stamps are good for compensating for that inevitable jitter.

Time is one of the hard problems in robots, because they are inevitably but non-obviously distributed systems.

Robots are annoyingly, wonderfully difficult.

•

dlt713705 6 days ago

As impressive as this setup may be, I'm still amazed at how slow this type of robot is, whether amateur or professional grade. I have no expertise in this field, but as an observer, the apparent progress in this area seems very limited. I guess my expectations are too high and my understanding of the problems to solve is too low.

•

mplappert 6 days ago

It’s partially my fault I currently clip the max speed _and_ I only input soft control changes when teleoperating to avoid crashing into things. The robot itself could definitely move more quickly than what you see in the video.

It would be interesting to explore how RL can be applied on top of my (flawed) human demos to optimize beyond what I’m able to do.

•

forrestthewoods 5 days ago

> And one I'm more confident about but expect disagreement on: not building on ROS 2 / LeRobot,

Tell me more! I am slightly biased in that direction. But can’t fully justify it at this point.

•

whiplash451 6 days ago

How does Lerobot prevent « full control » and « understanding »? I thought this was an open source library.

I am not an official supporter of the library but am asking out of curiosity.

•

mplappert 6 days ago

For understanding: I think the level is much deeper if I wrote the code vs reading someone else’s. Same applies to coding agents of course which is why I wrote most of it myself and only delegate some tasks (for example codex was great help at setting up telemetry dashboards or writing the custom glfw renderer).

On control: LeRobot will change all the time and I’ll be unaware of what changed. If something suddenly doesn’t work anymore, it’s a pain to find out. I can of course fork or pin but that defeats the purpose a bit.

At the end it’s also partially just preference: I wanted to write this layer myself, and I have opinions about how it should be architected, so I did.

•

bjt12345 5 days ago

I hate whinging but why isn't stuff like this not moved higher on HN's front page? This is a great article yet I keep seeing world politics and other matters rated higher - stuff that (unlike this article) will age like milk.

•

modeless 5 days ago

HN doesn't seem interested in robotics generally. You'll see it from time to time but the vast majority of great stuff never makes it to the front page. It's a shame, especially considering (as you point out) the constant political stuff (both overt and subtle) that does make it on here.

•

gessha 5 days ago

It could be that polarizing threads end up more actively voted on compared to “cool stuff” threads. We’re not immune to social network effects after all.

•

modeless 5 days ago

Yeah, that's an issue only admins can address. I would prefer much more aggressive enforcement of the rules about politics, and maybe some rule changes.

•

gessha 4 days ago

I built a very crude app sourcing comments from the HN firehose, classifying them using Gemma 4 and a prompt made from the comment guidelines. It had some amusing results.

The app did a decent job at surfacing problematic comments that a mod can do something about.

It was cool to optimize llama-cpp arguments for throughput. During the slightly off-peak hours the post processing was pretty much real time. I suspect a second 3090 would’ve be enough for peak posting hours too.

•

modeless 4 days ago

Yes I think there's huge potential in this direction. Forum moderation could be made far more scalable this way. If I didn't have another project I was already working on I'd be trying it right now.

•

timsuchanek 6 days ago

This is really exciting. Incredible that you can do this for this budget at home. Unthinkable a couple years ago.

•

mplappert 6 days ago

Thanks Tim and fun seeing you here :)

•

MrRobotics 6 days ago

Fascinating article. Keep up the work, Matthias!

•

blt 6 days ago

ROS sucks, good move. Too complicated

•

mplappert 6 days ago

That was my take 8 years ago glad to hear it’s still that

•

laxpri 6 days ago

whats the good alternative

•

laxpri 5 days ago

wow good post. Thanks for explaining everything in dam simple words .

•

gaolei8888 5 days ago

This is so cool.