What Humanoid Robots Can't Grasp
Robotic start-ups still need to overcome fundamental challenges that they're not being honest about.
We've seen a surge in humanoid robotic technologies. Many start-ups have popped up in this area, and VC investments have surged into the billions. It feels like not a week goes by in which a start-up releases a demo video with a robot doing crazy flips or performing an impressive task. If you're not a roboticist, it would be fair for you to think that the dawn of humanoid robotic technologies was on our doorstep.
However, robotics start-ups are not being earnest about the very real technical challenges that still exist in this space. There are fundamental challenges that robotics researchers need to overcome before these humanoid robots can go from the lab rooms to the factory floors, or even your home. Furthermore, humanoid robotics suffers from the same dilemma that large language chatbots do. Is there actually a demand for these products at a price that makes them a sensible business? In this piece I’ll explain the recent rise of humanoid robots, and why they’ve made seemingly exponential progress. Then we’ll discuss the technical limits of current approaches and whether there is actually market demand for humanoid robots.
Why Humanoids & Why Now?
Humanoid robotics have been making impressive strides since Honda’s humanoid robot, Asimo, fell down the stairs. Since that blunder, roboticists have nearly perfected robotic locomotion, demonstrated robots that can make your bed, and created robots that can respond to human instruction in real time. These massive advancements in robotics were not accidental. They represent a shift in how robotics problems are approached and the research specializations that now dominate the field.
If you were a roboticist before 2015, you likely entered the field from a mechanical engineering background. This engineering discipline was very successful in bringing us industrial robotics, the subset that focuses on applying robots in manufacturing and warehouse environments. These engineers and researchers applied forward and constrained inverse kinematics – a subset of mathematics – to successfully and efficiently control robotic arms for repetitive and well defined operations. However, these same approaches translated less smoothly to dynamic and haphazard environments and objects. If the object is unknown or the area of operation needs to be explored and understood, the methods used by mechanical engineers tend to be computationally intensive beyond practicality. This is why if you watch any demo video from this era, it’s typically played at 4x or 8x speed. Multi-segment robotic arms take a long time to calculate kinematic solutions in unmapped 3D spaces.
Two shifts began in the early 2010's that would free robotics research from the limitations of classical methods. First was the collaborative development of open source technologies. Much like the early web application space, the robotic software ecosystem started to see several open source projects become mainstream in the 2010's. The Robotic Operating System (ROS), which is used for simulating and testing robotic systems, saw its first stable release in 2010. In the locomotion (i.e. walking) space, ETH Zurich's Legged Robotics group has open sourced software to control quadruped robots. More recently in imitation learning, the field of robotics that attempts to "teach" robots behavior through mimicking human actions, labs at Stanford have open-sourced projects like Universal Manipulation Interface.

All of these technologies have served to make the field of robotics more accessible to researchers. But it has also helped launch several start-ups. The work by ETH Zurich’s Robotics group is the bedrock for Anymal, a Swiss robotics company that specializes in quadruped technologies. UMI has inspired the solutions being developed by companies like Generalist AI and Sunday Robotics.
The second major shift in the robotics space has been the increased presence of researchers from non-mechanical engineering backgrounds. Open-source technology increased the accessibility to robotics for specialists from machine learning, computer vision, and, even, motion capture disciplines. Reinforcement learning, a subset of machine learning that involves training algorithms through a reward function, has been particularly used in robotic locomotion. Computer vision coupled with imitation learning has been a common technique for helping robots understand objects and select pre-trained grasps for those objects. Yet, as with many alternative approaches in engineering, the new solutions that have been introduced by AI researchers have represented a set of trade-offs as opposed to a universal and optimal solution.
The Existing Gaps
It can be easy to look at the current crop of start-ups who are showing off dancing and acrobatic robots and extend their success to all robotic problems. Boston Dynamics has been successful in robotics locomotion for over a decade, and today bipedal locomotion seems to be a problem solved by many robotic companies including Figure, Unitree, and Agility Robotics. Even Tesla’s Optimus was walking within a year of the product’s announcement. But if you look closely at many of these robots, you’ll notice that their hand is often just a rubber ball or completely unused during the demo.
While the locomotion problem seems to have been mostly solved, other areas of robotics research has lagged. But to understand why, we should start by learning why locomotion has been successful. One reason is that researchers have found that proprioception is a useful mechanism for controlling a robot’s balance. Proprioception is the body’s “sixth sense” in understanding its own position and orientation in space. It’s why when you’re balancing on a curb, your upper body instinctively tilts away from the street. Your body subconsciously knows which muscles to flex and adjust to avoid a fall. Similarly, roboticists have found that fusing internal orientation sensors in a robot and training those robots to adjust their motors in response to sensor readings produces a robot that is excellent at balancing and walking.
Ironically, robotic locomotion is a simpler problem to solve because walking is influenced by fewer external variables. Proprioception is useful for locomotion, because it depends mainly on internal variables of the robot. Robots can assume that the ground will stay still and the same regardless of how they act. Whereas for grasping you need to understand both the robot’s hand and the object being manipulated. Grasping and manipulation involves a constant feedback loop which calculates the next position of a robot’s fingers, all the ways the object will respond to such a position, and in turn, how the fingers should respond to the object’s behavior.
In classical mechanical approaches this involved traversing an exponentially growing decision tree that had to be recalculated every time something didn’t go according to plan. Those methods also required having a strong understanding of the properties of the object being manipulated. The material of the object impacts friction, which we can largely assume is the same across environments when designing robotic locomotion systems. If the manipulated object contains internal moving components, such as items in a package or water in a bottle, it would involve adjusting for a shifting center of gravity. The most ideal scenario would involve having as much internal data about the manipulated object as we do about the robot itself. Unfortunately, every object in the world doesn’t have internal sensors to feed data back to a robot.
The complexity associated with humanoid grasping is why robotics companies love to show impressive demos of robots dancing and walking in public, but they rarely do the same with grasping and manipulation. There are many impressive demo videos of robots operating in controlled environments and manipulating simple objects. Figure has published several demo videos of their robots interacting with variations of cylinders, spheres, and rectangular prisms. To their credit, these are impressive videos, but we’ve never seen the same robot enter a random house that it has never been in and brew a cup of coffee (to quote Steve Wozniak).
Fundamentally, efficient and generalizable human-like grasping is an unsolved problem. It’s unsolved in much the same way that that nuclear fusion or fully autonomous driving is unsolved. In both of these examples, we have demonstrations of these technologies working on small scales or in specific conditions. The National Ignition Facility achieved fusion ignition in 2022, but the technology has not been added to any power grid. Self-driving cars have achieved impressive results, yet robo-taxi companies have had to scale back ambitions to focus on achieving high success rates within specific cities. Similarly, there are robotic grasping solutions that make good demo’s and prove concepts. But there is yet to be a humanoid robot that can adapt (i.e. generalize) and perform at the speed (i.e efficient) that a human can.
As mentioned earlier, grasping research that was led by mechanical researchers tended to develop fairly adaptable and comprehensive solutions that were ultimately slow. One of the roboticists I interviewed in the preparation for this essay referred to this as the “worst case scenario solution”. Theoretically the robotic arms from this era could be placed in a random setting and given a random object and they could devise a solution for picking and placing the object. These approaches did not assume that the robots had a pre-trained understanding of their environment or the objects they were interacting with.
This is the main shift we’ve seen with the new approaches devised by machine learning and computer vision researchers. The robots use AI training methods to develop an understanding of objects and how to interact with them. This involves quite literally preparing thousands of simulations that demonstrate to the robot how to (and how not to) pick up objects. With these approaches the robots have a better understanding of the objects. For example with the mechanical approach, a robot that is attempting to pick up a cup, would just view the object as an exotic donut. In the machine learning world, the robot recognizes that a cup is a cup and there are certain ways you do and do not pick up a cup.

The same interviewed roboticist described this approach as the “best case scenario solution”, because the approach hopes that the robot’s understanding of an object can be generalized to other similar objects. After all, how different is one cup from another? Well this is a real limitation with this approach. AI solutions require constructing vast data sets that hope to capture the breadth of possibilities that a robot may encounter. However they don’t guarantee that a robot will be able to form a grasp on every object. This is an optimistic approach that hopes for “the best case scenario”. But fortunately, the base case scenario may cover about 90% of cases.
The AI approach also results in simple grasps. If you observe most operations by Figure’s humanoid robots, the hand is usually performing a simple enclosed grasp around the object. The video below from Figure’s recent demo video demonstrates this simple yet unrealistic grasp. The robot opens the door be pressing the door handle down with the thumb joint. A human would actually open this door by grabbing the handle and twisting their wrist. The simple grasp is a safe bet for ninety percent of objects. But surrounding the object with your hand will not work with many tools, such as scissors or chopsticks. Ultimately this is due to the robot having a limited feedback loop with the manipulated object.
Moving past grasping, roboticists I spoke with expressed even greater concern over the “brain gap”. Robots are limited in their understanding and interpretation of the world. A truly useful humanoid robot needs to not only recognize objects but also understand their broader contexts. How does a robot handling a knife know that this is a dangerous tool and not just a triangle attached to a rectangular handle? How does a robot that learns from imitation learning know that what it sees on the TV is not something to imitate? How does the robot handle a knife, with a mystery thriller playing on the TV, and using a hallucinatory model, know not to act out the television scene on its owner?
I’m being a bit of an alarmist, but the point stands. The way many robots are trained today is akin to providing a video game character with a limited set of animations. The character can wave, pick-up a cup, or sweep the floor. But it doesn’t understand why people greet each other, how humans use cups, or that a floor is swept to remove dust. The roboticists I spoke with expressed doubt that any research group has even come close to solving this brain gap.
Is There Even A Market?
Ultimately for a humanoid robot to succeed as a product, they need to fill a demand at a cost that consumers find reasonable. Most humanoid robotics companies advertise their product as servicing either the manufacturing space or the domestic space. There is good reason to be skeptical that these robots will find demand in either market. Let’s take a look at both applications.
In manufacturing today, robots are deployed in three different environments. Robots are used in parts of the manufacturing line where the object is so large and heavy that a human cannot manipulate it. They can be used where the object is so small and requires high precision and frequency that a human cannot achieve with their bare hands. Finally robots are used in environments that are hazardous to human health. In all of these settings today, robots expect a highly constrained and predictable environment.
As you might be able to guess, humans are still needed for this area between large and microscopic, and where the air is still breathable. For example, while I was at Tesla we had a section of the car manufacturing line called “general assembly”. This is the part of the assembly line where wiring is installed through the car, panel lining is clipped into place, and decals are stuck to the car. This is done by human hands wearing PPE gloves. Part of the reason humans are used is that the car body is already painted - so there is a smaller chance that a human will scrape it. But more importantly, this process is a fairly active and dexterous process. It’s not simply a pick-and-place operation. Technicians have to duck through the car’s door frame, turn their torso 30 degrees, look up at an angle, and guide the wire harness through a hole. And several techs are doing different parts of this assembly at once. Each one has to do this fast, typically in 20 to 30 seconds. Getting a robot to do such an operation repeatedly, without damaging the remainder of the car, would be truly an engineering marvel. However, it would also require operations that simply no humanoid robot has even tried to perform.

People advocating for humanoid robots in manufacturing settings greatly underestimate the amount of strain current industrial robots experience and how often they need to be repaired. Industrial robots are responsible for performing the same operation over a thousand times per day. This has a real impact on their motors and hardware. When this hardware malfunctions as it routinely does, repair should be quick and easy. Industrial robots, much like consumer cars, are designed to be serviceable by dedicated onsite engineers. The more complex your robot’s form factor is, the more expensive these repairs will be. A humanoid form factor introduces a magnitude increase in complexity.
The common argument I hear in favor of humanoid robots is adaptability. Humanoid robots can rapidly respond to changes in the manufacturing process. This is not the strong argument people think it is. Assembly lines have been intentionally designed to be rigid since Ford invented them. In doing so we are able to keep them simple and easily repairable or replaceable. Adaptable thinking machines might be more useful in an office setting where processes are more loosely defined, and judgement often has to be used to know when to break from them.
The domestic application for robots will come down to two things, functionality and cost. I’m going to focus on cost because I’ve already expressed enough skepticism towards their functionality. Humanoid robots will be expensive. We don’t have a domestic consumer grade humanoid robot today to know exactly how expensive. The closest example is Unitree’s G1 which is $13,500. But it’s a fairly dumb robot and can do little more than walk and loop through animations. It also has a limited battery life of under two hours. The smartest robot available on the consumer market is probably Boston Dynamic’s robotic dog “Spot”. Spot’s base model costs $75,000. However this stand-alone kit can do little more than go for a walk while being remote controlled. If you want to attach a robotic arm or a LIDAR to Spot, you need to cough up $150k - or the price of a souped up Porsche 911.
Perhaps economies of scale and various financing schemes will be able to lower the price of such a humanoid robot. But, I still suspect that this will always be a luxury product for tech enthusiasts. Much like a car, I doubt such a humanoid robot will be a flat fee. The robot will require tune-ups. Motors will decay, sensors will need to be recalibrated, and parts will have to be replaced. Just you wait for all the DLC’s and subscriptions the robot companies will want you to purchase. “Sorry, doing the dishes is actually part of the ‘Kitchen & Cleaning’ expansion pack.”
Okay so the average joe schmoe won’t be liberated from the drudgery of loading the dishwasher. But maybe this will be a hit in the luxury market. This is an argument I could see. I’m not in the tax bracket where I hire domestic cleaning services, but I could see a world where this robot is more cost effective in the long run than hiring a cleaning staff.
After all that, I return to my original thesis: the form factor of a humanoid robot will fail to compete with the marvel that evolution produced over the course of tens of millions of years. For most manufacturing tasks that require a human today, we will continue requiring a human tomorrow. In cases where a humanoid robot is successful, we will find that the cost of a single robotic arm and a camera will outperform a humanoid form factor. In domestic environments humanoid robots will be able to only clean the “best case scenario”. It will successfully clean 90% of your apartment, but you will still find a dust bunny rolled up in a corner, your dishes will be covered in small smudges, and for some reason only half your table will be cleaned. The humanoid robots are not coming for your job.



