Project description
Robotics in the agricultural sector has a great potential to simplify tasks, reduce fatigue for laborers, and perform repetitive tasks persistently. Tasks like fruit detection, grasping, packaging, etc. are employed for robotic fruit harvesting. The project aims to add value to the fruit harvesting task with robots by simplifying and testing novel and smart vision- based solutions. The deployment of robots for fruit picking has been a challenging task due to uncertainties involved in the environment, etc. The scope of the work is narrowed down to fruit localization, grasping, and placing tasks. The project targets simulation and experimentation validation for the fruit grasping approach, however, the scope of the work is limited to robotic manipulation with vision. The state-of-the-art methodologies in fruit harvesting with robots focus on a two-stage process, in which first the fruit location is identified, and then further processing is implemented with sphere fitting in fruit point cloud points, centroid estimation axis estimations, and so on. The existing challenges are occlusion, fruit shape estimation, and interference from leaves and branches, slippage, etc. The project is in collaboration with the Institute of Mechanism Theory, Machine Dynamics and Robotics (IGMR), RWTH Aachen as part of Thesis. The next step of the project is combining the autonomous manipulation with the autonomous navigation to perform autonomous fruit harvesting. Subsequent sections will outline the objectives and their descriptions.
Related terms
A robot structure is a combination of electrical, mechanical, and electronic parts. In general, a robot structure has four basic systems: Sensors, Actuators, Motors, and Controllers [Ref.]. Kinematics models in robotics are used to study the motion and interactions between the robot system. There are some essential terms used in robotics to define the robotic models and understanding them is important.
- Links, joints and end effector: A robot structure is a multi-body system and is modeled by rigid structures, called links, and the links are connected by joints, which enable movement. An end effector is the outermost point of the last link in the robotic link structure [Ref.], [Ref.], [Ref.].
- Operational and joint space: The operational space defines the end effector pose. The joint state defines the joint state values of joints for the robotic structure to reach a goal position. [Ref.].
- Forward Kinematics and Inverse Kinematics (IK): The operational space defines the end effector pose. The joint state defines the joint state values of joints for the robotic structure to reach a goal position. [Ref.]. [Ref.]
- Image and colorspace: In the vision domain, an image is a single or multiple matrices of numbers for color representation [Ref.]. In RGB colorspace, all the colors are represented by a combination of red, green, and blue matrices, and they are referred to as channels. An image with a single channel is referred as grayscale image. Each channel of RGB is a 2D matrix of numbers encoded, called pixels. So, in RGB colorspace, an image is a combination of red, green, and blue matrices with pixel values. Mathematical operations are used to perform tasks like edge detection, noise removal, color detection, etc.
- Convolution: Convolution is defined as a mathematical operation of the integral product of two functions with one being reversed and the operation tells how the functions are related. In the vision domain, an image matrix is multiplied and added element-wise with another matrix called the kernel. The kernel traverses along the whole image to perform calculations and generate a new matrix after the operation [Ref.].
- Deep Learning (DL), Neural Network (NN) and Convolutional Neural Network (CNN): Deep Learning (DL) is the deployment of artificial neurons to learn the features within the data. In simple terms, DL is the deployment of multiple layers of Neural Networks (NNs) to learn the key features of data that would help in predicting the output for some input data. The data learning and making sense of it is performed by the algorithm. Whereas Machine Learning models learn the mapping between the data input and output, however, it is not restricted to NN algorithms, so, DL is a specialized field of Machine Learning. The standard working unit of a Neural Network is the Perceptron and it has inputs, weights, and bias parameters. It calculates the output by multiplication of weights with inputs and adds the bias. It learns the mapping between the input and output by back layer propagation of gradients and changes the weights and bias values with iterations. When more Perceptrons are arranged in a layer, they form a multi-perceptron layer. A standard NN has an input layer, hidden layers, connections between the layers, activation functions, and an output layer. A NN learns the mapping between the output and input by optimizing the weights and biases of the Perceptrons. For image data, Convolutional Neural Network (CNN) have been quite successful in learning the mapping between the input image and the output computer vision tasks like classification, object detection, or segmentation tasks [Ref.].
-
Object classification, detection and segmentation:
In the object classification task, the output layer of a Neural Network (NN) generates a single output
encoding or number, which is mapped to a class like dog, apples, etc. Whereas the Object
detection task requires the location as well as classification of the object. In addition to
encoding for classification, the NN is trained to generate two pairs of (x,y) coordinates to
define the box for the object for anchor-based detection. For anchor-free tasks, one pair
of (x,y) for the object center is required.
Segmentation task classifies the pixel in the image belonging to class type. The image
undergoes convolutions and upscaling to generate a matrix output and during training, the
NN adjusts the weights and biases to predict the same values for a class. The instance
segmentation task counts how many times a particular class is present whereas in the
standard segmentation task, only the identification of class is required.
Related Literature
A wide range of approaches have come up for vision based robotic fruit harvesting task and to get a holistic view of the trend, the state of the art works are categorized based on the problems they aim to solve during grasping and harvesting with robots:
- Occlusion-related works [Ref.] [Ref.] [Ref.] [Ref.]
- Segmentation improvement-related works [Ref.] [Ref.] [Ref.] [Ref.] [Ref.] [Ref.]
- Localization improvement-related works [Ref.] [Ref.] [Ref.]
- Novel architectures or approaches [Ref.] [Ref.]
Author, year | Methodology | Data Type | Key Innovations | Pros | Cons | Harvest / Grasp fruits |
Li et al. (2022)[Ref.] | Occlusion work around | RGB-D | Frustum point cloud fitting | Robust against occlusion | Structured farm testing | Yes, Apples |
Gong et al. (2022) [Ref.] | Occlusion work around | RGB-D, Infrared | Reconstruction with CNNs | Restoration of shape | Collision | Yes, Tomatoes |
Menon et al. (2022) [Ref.] | Occlusion work around | RGB-D | Reconstruction with software | Less manual touch | Complicated | Yes, Sweet peppers |
Liu et al. (2022) [Ref.] | Occlusion work around | RGB | Key point estimation | Circular Bounding boxes | Not tested on robot | Yes, Tomatoes |
Yan et al. (2023) [Ref.] | Segmentation improvement | RGB-D | Transformer segmentation | Stem & Grasping key points | Not tested on robot | Yes, Pumpkin |
Kang et al. (2020)[Ref.] | Segmentation improvement | RGB-D | Dasnet | Fruit & branches segmentation | Obstruction from leaves | Yes, Apples |
Kang et al. (2020) [Ref.] | Segmentation improvement | RGB-D | Mobile-Dasnet and PointNet | Robust fruit point cloud | Obstruction from other fruits | Yes, Apples |
Kang et al. (2021) [Ref.] | Segmentation improvement | RGB-D | YOLACT (You Only Look at Coefficients) & PointNet | Robust fruit point cloud | Tested in structured farm | Yes, Apples |
Lin et al. (2019)[Ref.] | Segmentation improvement | RGB-D | Branch normals prediction | Fruit axis estimation | Occlusion affected results | Yes, Guava |
Lin et al. (2019) [Ref.] | Segmentation improvement | RGB-D | Gaussian Mixture Models | Adaptable for multi fruits | Not tested on robot | Yes, Citrus fruits |
Yu et al. (2020)[Ref.] | Object detection | RGB-D | Oriented Bounded Boxes | Stem orientation | False detections | Yes, Strawberries |
Onishi et al. (2019)[Ref.] | Object detection | RGB-D | Underside grasping | Damage free grasping | Vertical orientation only | Yes, Apples |
Chen et al. (2022)[Ref.] | Object detection | RGB | Vision based impedance | Damage free grasping | Planar surface grasping | No, Apples, Oranges |
Lin et al. (2023) [Ref.] | Grasping rectangle proposals | RGB | Shape approximation | Work for unseen objects | Planar surface grasping | No, Banana |
Chen et al. (2023) [Ref.] | Reinforcement learning | RGB-D | Soft Actor-Critic(SAC) algorithm | Learning in simulation | Planar surface grasping | No, Banana |
Developing the Approach
Following are the considerations and observations from the state-of-the-art works:
- Deep NN models have been used extensively for vision-based solution.
- The two-stage approach with vision model is followed with a depth camera, followed by motion planning with IK solution of joints.
- The first stage performs object detection or instance/semantic segmentation to isolate the fruit from the background in the image.
- The second stage involves point cloud processing or geometry manipulation techniques like estimating a sphere or frustum within a point cloud, etc. for estimating grasping pose. The data processing on a complete scene point cloud is not done.
- Training of two or multiple NNs, which requires considerable computation resources for training the network and on robot system.
- Time investment on data preparation for multiple networks. Usually, the data annotation formats are different and objects have to be marked with bounding boxes or polygons on hundreds or thousands of images.
- Pose estimation by simplifying geometrical estimation like with sphere fitting, which is not applicable with cylindrical or other shapes.
- Relying on only one key feature information like stem for pumpkin grasping points estimation in [Ref.] and during occlusion scenarios, the estimations could not be processed properly.
- Deployment of a single NN to get the grasping points, meaning lesser computations as compared with two stages of computations
- Lesser time and effort investment in dataset preparation, as for two different networks data annotation in certain format must be prepared and it can be reduced.
- Generalization for the fruit harvesting with robots. Some approaches used axis information for grasping pose estimation, nevertheless, every fruit has some unique points like center, stem, bottom, etc. that can be expanded to multiple domains.
Research design
- Mask R-CNN
- CenterNet
- Detectron2
- Keypoint Detection and Feature Extraction for Point Cloud Registration(Kpsnet)
- OpenPose
- Simple Vision Transformer Baselines for Human Pose Estimation (ViT-Pose)
- Key.Net
- YOLOv8 (You Only Look Once version 8)
- Application feasibility: It tells whether the model could be used or adapted for fruits.
- Ease of use: It tells whether the model possesses desirable parameters like high detection, ease of training, etc.
- Custom tuning feasibility: It informs whether the training data for the model is available or the dataset size has to be large. Some works haven’t been shared publicly and it could impact the model selection.
- Novelty: It tells whether the work has already been used in similar works related to fruit grasping.
- A solid black ball represents that the parameter suits the requirement completely and has one point.
- A half-black ball represents that the parameter might fit as the information is not available or not comparable with respect to other models and has a half point.
- A solid white ball represents that the parameter doesn’t fit and has a score of zero.
Evaluation of models based on Harvey balls visualization and scoring
YOLOv8 Description and Training
YOLOv8 is a multipurpose NN model from Ultralytics [Ref.] and it offers object detection, segmentation, pose estimation, tracking, etc. YOLO series models are famous for their good detection speed and user-friendliness [Ref.] . YOLOv8 pose detects key points first and then associates them to human instances. There are mainly two types in the pose estimation model categories [Ref.]:
- Top-down methods: These methods first detect the object and then detect the pose for each object using key points or features. On the one hand, these methods have high complexity and on the other hand, better accuracy.
- Bottom-up methods: These methods detect the key points in a single stage, mostly by estimating the heat maps for the key points and relating them to the object via grouping. They are faster than top-down and, nevertheless, are comparatively less accurate.
- The models are intended for humans with 17 key points and tiger pose estimation of 12 key points like eyes, nose, joints, etc. The skeletal structure output is enabled only for 17 key points input [Ref.] .
- Only one class of object could be detected at a time. The modification for the class can be performed by providing encoding information in the yaml file.
- There are six architectures for YOLOv8 pose estimation models [Ref.] : YOLOv8n-pose (nano), YOLOv8s-pose (small), YOLOv8m-pose (medium), YOLOv8l-pose (large), YOLOv8x-pose (extra large) and YOLOv8x-pose-p6 (extra large with an additional layer).
- The model outputs a vector with x, y coordinates and confidence score or visibility for each key point.
- The models are trained on the Common Objects in Context (COCO) pose dataset [Ref.] and COCO dataset format and YOLO pose annotation format are compatible.
- Fruit Images for Object Detection: It is available on Kaggle [Ref.] and is segregated into train and test. It has images of apples, bananas, and oranges and has XML annotation files for object detection. It has varying images from oranges on a tree to oranges with a white background.
- Roboflow datasets: Two datasets from the Roboflow platform have been taken [Ref.] [Ref.]. These datasets have around 140 images in total. In these datasets, the images of oranges on a tree and on the farm are present and damaged oranges with scabs and some with white backgrounds are also present.
- Images from camera: The images with the camera for robot setup are considered and used for training to tune and deploy the model effectively with the camera settings for real harvesting tasks.
Performance of YOLOv8 pose nano model: Loss curves on training and validation datasets showing convergence of losses. Precision, recall curves reach close to annotated data.
Vision Models and Multiple Detections
The YOLOv8 object detection model is selected as the baseline model for comparing the approach as the system doesn’t have Nvidia Graphics Processing Unit (GPU) and the computations are going to be performed on the system. The object detection model outputs the center of the bounding box and the YOLOv8 pose outputs the multiple key points along with the bounding box and is similar and two variants of YOLOv8: one with pre-trained weights and the other trained on a similar dataset of YOLO pose for 200 epochs. Three cases are going to be compared:
- Fruit harvesting with YOLOv8 pose trained on custom dataset.
- Fruit harvesting with YOLOv8 pre-trained model.
- Fruit harvesting with YOLOv8 trained on custom dataset.
Comparison of goal position with baseline models.
- The instance with a higher bounding box probability is selected first as the target. Either of the center and bottom key points must be visible and only these two key points are considered for goal position.
- Based on the confidence values of bottom and center points with a threshold value of 0.5. One key point is selected in the proposed approach so that there is a good confidence score for the goal position. The center of bounding boxes for YOLOv8 object detection models are selected.
- The shape estimation with YOLOv8 pose model is performed with average of distances between the key points or width or diameter estimation. On the other hand, the bounding box width governs the opening of the gripper for YOLOv8 object detection models in the methodology.
Multiple instances detection and goal position selection with YOLOv8 pose.
Methodology
Once the fruit location and width are available, IK and motion planning is performed with MoveIt library in ROS. Considering the safety and ensuring the robot is ready and all systems are working, following is the action sequence:
- Successful publishing of stabilized transforms for orange keypoint or bounding box center from the keypoint detection node.
- Setting the robot to harvesting ready position of the arm with a fixed value of joints, picking pose, which indicates that the robot is ready for the grasping task. Changing of the position of target fruit or robotic setup if path planning fails to generate solutions.
- Turning the gripper on or opening of the gripper.
- Traversing to the goal position and attempting for a grasp by closing the gripper or suction with vacuum gripper. In case of a failed grasping attempt, two more trials are considered.
- Moving the robot arm to fruit drop pose with another fixed joint pose near the collection spot, placing pose.
- Opening the gripper to put the fruit in a basket or collector box.
- Moving back to picking pose and turning the gripper off or close if the task is completed.
Pipeline for robot grasping planning with YOLOv8 pose and object detection models.
Robot Setup
The following components have been used and for testing:
- Ufactory xArm5 robotic arm
- Intel Realsense d405 depth camera
- Inspire Robotics right hand gripper
Usecase Simulation
The github [Ref.] repository provides the URDF and the MoveIt packages and the specifications and the constraints are already added. However, the geometric constraints have limited its usage for autonomous operation. The trials were conducted with MoveIt and the Software Development Kit of xArm5 in Gazebo and on the real arm, nevertheless, the IK solvers could not solve the configuration for all the goal states. The following approaches were tried and tested:
- Adjusting IK solvers: The MoveIt provides a multitude of solvers like Rapidly- exploring Random Trees (RRT), Bi-directional Transition-based Rapidly-exploring Random Trees (Bi-TRRT) in Open Motion Planning Library (OMPL), etc., and altering them have not solved the problem as they didn’t generate solutions for the goal poses in most of the cases.
- Analytical solver plugin test: The IK-Fast plugin provided by MoveIt provides an option to integrate the solver which has to be run on either Docker or an older version of ROS. The analytical solver has failed in the Docker image and it has not generated solutions due to constraints in the geometry [Ref.] .
- Approximation in joint space planning: A function was written and tested by excluding the first and last link of xArm5 as the other three joints move in one plane. The joint states would then be used to perform the arm movement along with the other two joints. The multiple solutions for the IK by this assumption and filtering them out is not a feasible option.
- Addition of joint tolerances for IK solvers: By adding some tolerances in the goal configuration and using the position only IK, the solvers generate good solutions for almost all goal poses [Ref.] .
- Cartesian path planning: The Cartesian planning approach takes the points as input and plans the motion to reach the goal configuration. Without the tolerances and position only IK, it could not generate good solutions. By splitting the goal position into x, y, and z points respectively positions with tolerances, the motion planning has worked fine.
Usecase Hardware
- Support:The robot setup is planned to be mounted on a mobile platform for outdoor testing and for safety considerations in indoor testing, the robot has been mounted for testing on horizontal fixed support.
- Robot start pose and fruit dropping pose:A set of predefined positions is set for the fruit harvesting task. The picking pose is the ready state for fruit harvesting with the palm open and the last joint of the arm horizontal to the ground. The placing pose is the fruit-dropping state with the palm open and the last joint of the arm with 90 degrees rotation
- Path planning method for goal position:Once the 3D transform frame of the fruit is published, the x,y, and z positions are split into three parts in the robot base coordinate system for the cartesian path planning method: x-axis displacement, y-axis displacement, and z-axis displacement. In Pose planning method, the Bi-TRRT planner of MoveIt has performed better during trials with planners and it is chosen for motion planning with tolerances to reach goal position autonomously. The cartesian planning method has been used for motion planning to avoid the dependency on the pose angles and the limited DOF due to the geometrical constraints of the arm.
- Gripper closing control:The mathematical function to convert gripper opening in terms of angles is defined to control the closing of the gripper with the key points and bounding box width parameters. In case, there are errors in width transform publishing, the gripper is set to close to a half-closed closing configuration.
Usecase Observations
The speed has been set low during trials to avoid any mishappenings and damage to the robot arm and other components. The cartesian planning method has been used for motion planning to avoid the dependency on the pose angles and the limited DOF due to the geometrical constraints of the arm. An average of 20 trials per model for each of the cases, i.e. 60 trials for each case for three models has been performed. In total, 180 trials have been conducted to evaluate the models while maintaining the same challenge or environment for all the models for each attempt.The performance of the approach has been evaluated in the following cases:
- Orange grasping and placing without any occlusion.
- Orange grasping and placing with occlusion.
- Orange grasping and placing from a small plant.
- Pre-trained YOLOv8 model has performed poorly when it comes to detection accu racy and grasping attempts. The reason is that the Calamondin orange fruit has a smaller size when compared with other varieties of oranges. Whereas the custom- trained YOLOv8 object detection and pose model have performed better than the pre-trained model with YOLOv8 pose having higher detection rates. The overall detection accuracy with YOLOv8 pose averages more than 50% in the test cases including occlusion cases and five to seven percent higher than YOLOv8 custom- trained object detection model.
- The successful grasping attempts with YOLOv8 pose have been the highest in each of the cases and YOLOv8 custom-trained is closer while performing the grasping. However, most of the time fruit has fallen off prematurely before reaching the placing pose with YOLOv8 custom-trained, thus, causing damage to the fruit. The fruit shape approximation is more accurate with YOLOv8 pose model.
- Custom trained YOLOv8 object detection model has performed better with occlusion cases with stable grasping than without occlusion cases, however, it has caused damage to the surrounding fruit as well. The higher stability of grasp is because of the less region visibility of orange, the smaller bounding box width of the box governs the opening of the gripper with a tighter grip, and the damage to fruits has increased.
- The leaves and branches have presented obstructions in some cases during the fruit grasping and if the grasp is not stable, the fruit would either fall while plucking or be damaged. With the low detection rates with pre-trained YOLOv8 model, most of the time the detected fruit is located on the outer boundary of the plant, and hence, the damage cases to fruits are none.
- The depth estimation of the camera on boundaries of the field of view, which is near the rectangular boundaries in the image and just outside the specified range of the camera, is not accurate. It has caused variations in the goal position and has impacted the path planning and stable grasping for all models.
- For non-occlusion cases with YOLOv8 pose model near the rectangular boundary of the field of view, the width transforms have not published consistently, the gripper has switched to predefined closing value in the absence of width transform and it has resulted in unstable grasping. However, the fruit has been dropped in the box in some instances a bit earlier than the placing pose and it has been considered an unstable grasp.
- Custom trained YOLOv8 pose is five to seven percent more accurate in terms of detection accuracy when compared with custom trained YOLOv8 object detection model.
- The fruit grasping success when performed on the small plant is almost double when compared with object detection models.
- The fruit shape approximation is better than with the bounding box-based approximations and it has caused lesser damage to fruits.
- The harvesting success lies in the range of 60% and 70 % for the test cases in indoor cases. Whereas with object detection models, the harvesting success rate falls in between 30% and 60%.
Conclusion
Multiple key points: top, center, bottom, left, and right have been used to estimate the shape of the fruit and a provision has been made to harvest the fruit from either the center or bottom key points based on high detection accuracy, which has enabled to perform fruit harvesting in any view. YOLOv8 pose model has been trained on public and custom datasets and it has been evaluated with pre-trained and custom-trained YOLOv8 object detection models. YOLOv8 pose has performed better than the baseline models in terms of detection accuracy and ensuring a stable grasp. The model has a higher harvesting success percentage as compared to the baseline models when testing is conducted for similar targets and situations. Due to limited resources, small weights have been used to reduce the training time and inference time for the harvesting task. The approach has been tested and evaluated successfully with the robot setup in the laboratory on the orange plant. The following are the benefits of the proposed approach:
- The approach detects the key points with the fruits in a single stage and lesser effort is required as compared with two-stage approaches, which perform the detection of fruit first and then use the detections to isolate the point cloud of fruits.
- The fruits with bottom and center key points are proposed for goal position with the proposed methodology and there is no dependency on any fixed side or bottom view.
- The methodology has performed better as compared with the object detection model based approach.
- YOLOv8 pose model detects only one class. It is not possible to train one model for multiple classes of fruits.
- The methodology with the setup has been tested for small region movements due to the working range limits of the camera and geometry constraints of the arm.
- The methodology has been tested on a small plant and requires testing on a tree or on a farm where the occlusion and obstruction from branches, and leaves are higher.
Contribution
The key point-based methodology has been present in the fruit harvesting works, and most of them revolve around determining the mask and estimating the centroid or fixed pixel distance and so on. Multiple key point detections with computer vision models have been tested before for estimating fruit and vegetable shape, for instance, with Detectron2 [Ref.], however, they have not been deployed in the scope of fruit harvesting with robots. The latest version of the human pose estimation model from Ultralytics, YOLOv8 pose, has been tuned for key points detection in oranges directly without any post-processing. The devised approach estimates the shape based on averaging out the distance between key points to approximate the fruit and ensures that the fruit whether visible in the side view, bottom view, or any other view, the robot arm could approach the fruit and perform the harvesting task successfully.Future works
The key point estimation model-based approach has been tested and compared with the object detection models. The methodology could be evaluated with the point cloud-based approaches or with segmentation model-based approaches. Testing in the farm and making the complete process autonomous is the next step, with the mobile base navigating in a prescribed area on a map or with outdoor navigation with the Global Position System (GPS), height adjustment of the robot arm with prismatic joint setup, and performing the fruit harvesting. A possible improvement in the approach could be the use of large NN weights, which on one hand, would stabilize the detections and would require higher computation resources.