Autonomous Orchard harvesting

/ Comments 0
Autonomous Orchard harvesting - Discover the Future of Autonomous Orchard Harvesting: Revolutionizing Fruit Picking and Farming Efficiency
Explore the future of agriculture with ROS-enabled autonomous orchard farming. Maximize efficiency and sustainability using cutting-edge robotic systems and ROS integration for seamless orchard management. Discover how ROS automation is transforming fruit cultivation for a smarter, greener future. <sup> <span style="color: #34A734;"> [4] </span> </sup>

Project description

Robotics in the agricultural sector has a great potential to simplify tasks, reduce fatigue for laborers, and perform repetitive tasks persistently. Tasks like fruit detection, grasping, packaging, etc. are employed for robotic fruit harvesting. The project aims to add value to the fruit harvesting task with robots by simplifying and testing novel and smart vision- based solutions. The deployment of robots for fruit picking has been a challenging task due to uncertainties involved in the environment, etc. The scope of the work is narrowed down to fruit localization, grasping, and placing tasks. The project targets simulation and experimentation validation for the fruit grasping approach, however, the scope of the work is limited to robotic manipulation with vision. The state-of-the-art methodologies in fruit harvesting with robots focus on a two-stage process, in which first the fruit location is identified, and then further processing is implemented with sphere fitting in fruit point cloud points, centroid estimation axis estimations, and so on. The existing challenges are occlusion, fruit shape estimation, and interference from leaves and branches, slippage, etc. The project is in collaboration with the Institute of Mechanism Theory, Machine Dynamics and Robotics (IGMR), RWTH Aachen as part of Thesis. The next step of the project is combining the autonomous manipulation with the autonomous navigation to perform autonomous fruit harvesting. Subsequent sections will outline the objectives and their descriptions.

Picture 2
Inspire robotics gripper testing
Picture 3
Vision model testing
Picture 3
xArm5 testing in simulation
Picture 3
xArm5 testing in real

Related terms

A robot structure is a combination of electrical, mechanical, and electronic parts. In general, a robot structure has four basic systems: Sensors, Actuators, Motors, and Controllers [Ref.]. Kinematics models in robotics are used to study the motion and interactions between the robot system. There are some essential terms used in robotics to define the robotic models and understanding them is important.

  1. Links, joints and end effector: A robot structure is a multi-body system and is modeled by rigid structures, called links, and the links are connected by joints, which enable movement. An end effector is the outermost point of the last link in the robotic link structure [Ref.], [Ref.], [Ref.].
  2. Operational and joint space: The operational space defines the end effector pose. The joint state defines the joint state values of joints for the robotic structure to reach a goal position. [Ref.].
  3. Forward Kinematics and Inverse Kinematics (IK): The operational space defines the end effector pose. The joint state defines the joint state values of joints for the robotic structure to reach a goal position. [Ref.]. [Ref.]
  4. Image and colorspace: In the vision domain, an image is a single or multiple matrices of numbers for color representation [Ref.]. In RGB colorspace, all the colors are represented by a combination of red, green, and blue matrices, and they are referred to as channels. An image with a single channel is referred as grayscale image. Each channel of RGB is a 2D matrix of numbers encoded, called pixels. So, in RGB colorspace, an image is a combination of red, green, and blue matrices with pixel values. Mathematical operations are used to perform tasks like edge detection, noise removal, color detection, etc.
  5. Convolution: Convolution is defined as a mathematical operation of the integral product of two functions with one being reversed and the operation tells how the functions are related. In the vision domain, an image matrix is multiplied and added element-wise with another matrix called the kernel. The kernel traverses along the whole image to perform calculations and generate a new matrix after the operation [Ref.].
  6. Deep Learning (DL), Neural Network (NN) and Convolutional Neural Network (CNN): Deep Learning (DL) is the deployment of artificial neurons to learn the features within the data. In simple terms, DL is the deployment of multiple layers of Neural Networks (NNs) to learn the key features of data that would help in predicting the output for some input data. The data learning and making sense of it is performed by the algorithm. Whereas Machine Learning models learn the mapping between the data input and output, however, it is not restricted to NN algorithms, so, DL is a specialized field of Machine Learning. The standard working unit of a Neural Network is the Perceptron and it has inputs, weights, and bias parameters. It calculates the output by multiplication of weights with inputs and adds the bias. It learns the mapping between the input and output by back layer propagation of gradients and changes the weights and bias values with iterations. When more Perceptrons are arranged in a layer, they form a multi-perceptron layer. A standard NN has an input layer, hidden layers, connections between the layers, activation functions, and an output layer. A NN learns the mapping between the output and input by optimizing the weights and biases of the Perceptrons. For image data, Convolutional Neural Network (CNN) have been quite successful in learning the mapping between the input image and the output computer vision tasks like classification, object detection, or segmentation tasks [Ref.].
  7. Object classification, detection and segmentation: In the object classification task, the output layer of a Neural Network (NN) generates a single output encoding or number, which is mapped to a class like dog, apples, etc. Whereas the Object detection task requires the location as well as classification of the object. In addition to encoding for classification, the NN is trained to generate two pairs of (x,y) coordinates to define the box for the object for anchor-based detection. For anchor-free tasks, one pair of (x,y) for the object center is required. Segmentation task classifies the pixel in the image belonging to class type. The image undergoes convolutions and upscaling to generate a matrix output and during training, the NN adjusts the weights and biases to predict the same values for a class. The instance segmentation task counts how many times a particular class is present whereas in the standard segmentation task, only the identification of class is required.
    Picture 2
    Object detection and segmentation [Ref.]
    Picture 3
    Forward and inverse kinematics [Ref.]

Related Literature

A wide range of approaches have come up for vision based robotic fruit harvesting task and to get a holistic view of the trend, the state of the art works are categorized based on the problems they aim to solve during grasping and harvesting with robots:

  1. Occlusion-related works [Ref.] [Ref.] [Ref.] [Ref.]
  2. Segmentation improvement-related works [Ref.] [Ref.] [Ref.] [Ref.] [Ref.] [Ref.]
  3. Localization improvement-related works [Ref.] [Ref.] [Ref.]
  4. Novel architectures or approaches [Ref.] [Ref.]
Occlusion is a difficult problem to tackle and the trend in recent works is to estimate the shape of the hidden portion of fruit by processing point cloud data. The visibility of only a few portions of the fruit has been a challenge in the works. False detections may lead to fruit and gripper damage and there is a need and scope for robust detection and damage-free grasping. If a computer vision model is incorporated that gives the information for multiple grasping points like the CenterNet model in [Ref.] uses one point for the whole object, it would increase the chances of safe and successful grasping. In the segmentation improvement works, the trend is towards utilizing the information of the surroundings or determining the fruit axis. These type of approaches have attempted to incorporate tree branch or stem information, nonetheless, it does not apply to all the cases and requires considerable data annotation time and effort for individual fruits. Moreover, it is a time-intensive task to label branches in hundreds or thousands of images. Transformer-based networks are also utilized for the generation of masks. The instance-segmented masks are more in use than semantic-segmentated masks in the related works, as individual fruit masks are detected. Another network for point cloud processing is used to identify the grasping pose as the quality of masks are affected by obstruction from leaves, lighting conditions, surrounding fruits, branches, etc. Whereas in the object detection-related works, the models generate the estimated rectangular bounding box for fruit and additional features are integrated for grasping points like bottom or stem. If all the points inside bounding boxes are used for point cloud filtering like that in the instance segmentation improvement works for getting fruit points, the background points and noise would also be added and the instance segmentation networks are frequently used over the object detection model in the recent works. The analysis of works tells that a suitable approach for robotic grasping should be focused on certain distinct and identifiable key features of fruit, that can help in making a quick judgment. Keeping in mind that occlusion is inevitable, there is a need to form an approach that focuses on the available information and how it can be molded together to map the fruit shape accurately to make the robot grasping task easier. For instance, fruit center, stem, etc. are the identifiable features and should be sufficient to judge the size in most of the situations. Therefore, an approach focusing on incorporating key features of fruit would be a feasible solution and it could be generalized and simplify robot harvesting. Table
Author, year Methodology Data Type Key Innovations Pros Cons Harvest / Grasp fruits
Li et al. (2022)[Ref.] Occlusion work around RGB-D Frustum point cloud fitting Robust against occlusion Structured farm testing Yes, Apples
Gong et al. (2022) [Ref.] Occlusion work around RGB-D, Infrared Reconstruction with CNNs Restoration of shape Collision Yes, Tomatoes
Menon et al. (2022) [Ref.] Occlusion work around RGB-D Reconstruction with software Less manual touch Complicated Yes, Sweet peppers
Liu et al. (2022) [Ref.] Occlusion work around RGB Key point estimation Circular Bounding boxes Not tested on robot Yes, Tomatoes
Yan et al. (2023) [Ref.] Segmentation improvement RGB-D Transformer segmentation Stem & Grasping key points Not tested on robot Yes, Pumpkin
Kang et al. (2020)[Ref.] Segmentation improvement RGB-D Dasnet Fruit & branches segmentation Obstruction from leaves Yes, Apples
Kang et al. (2020) [Ref.] Segmentation improvement RGB-D Mobile-Dasnet and PointNet Robust fruit point cloud Obstruction from other fruits Yes, Apples
Kang et al. (2021) [Ref.] Segmentation improvement RGB-D YOLACT (You Only Look at Coefficients) & PointNet Robust fruit point cloud Tested in structured farm Yes, Apples
Lin et al. (2019)[Ref.] Segmentation improvement RGB-D Branch normals prediction Fruit axis estimation Occlusion affected results Yes, Guava
Lin et al. (2019) [Ref.] Segmentation improvement RGB-D Gaussian Mixture Models Adaptable for multi fruits Not tested on robot Yes, Citrus fruits
Yu et al. (2020)[Ref.] Object detection RGB-D Oriented Bounded Boxes Stem orientation False detections Yes, Strawberries
Onishi et al. (2019)[Ref.] Object detection RGB-D Underside grasping Damage free grasping Vertical orientation only Yes, Apples
Chen et al. (2022)[Ref.] Object detection RGB Vision based impedance Damage free grasping Planar surface grasping No, Apples, Oranges
Lin et al. (2023) [Ref.] Grasping rectangle proposals RGB Shape approximation Work for unseen objects Planar surface grasping No, Banana
Chen et al. (2023) [Ref.] Reinforcement learning RGB-D Soft Actor-Critic(SAC) algorithm Learning in simulation Planar surface grasping No, Banana
Table: Literature review

Developing the Approach

Following are the considerations and observations from the state-of-the-art works:

  1. Deep NN models have been used extensively for vision-based solution.
  2. The two-stage approach with vision model is followed with a depth camera, followed by motion planning with IK solution of joints.
  3. The first stage performs object detection or instance/semantic segmentation to isolate the fruit from the background in the image.
  4. The second stage involves point cloud processing or geometry manipulation techniques like estimating a sphere or frustum within a point cloud, etc. for estimating grasping pose. The data processing on a complete scene point cloud is not done.
The challenges with the two-stage vision-based solution for robot grasping are:
  1. Training of two or multiple NNs, which requires considerable computation resources for training the network and on robot system.
  2. Time investment on data preparation for multiple networks. Usually, the data annotation formats are different and objects have to be marked with bounding boxes or polygons on hundreds or thousands of images.
  3. Pose estimation by simplifying geometrical estimation like with sphere fitting, which is not applicable with cylindrical or other shapes.
  4. Relying on only one key feature information like stem for pumpkin grasping points estimation in [Ref.] and during occlusion scenarios, the estimations could not be processed properly.
The methodologies in the related works follow the steps to find out the grasping points or pose in multiple stages for a robotic arm, however, it can also be approximated by focusing on the information from multiple key features like stem, bottom, center, surface points, etc. on the fruit surface and detecting them. They can be used to estimate the grasping points or pose. Thus, a pose or key point estimation model can be a suitable model for the key feature or point detection. Recently, human-based pose estimation models have been deployed successfully with NNs [Ref.] . The idea can be employed and tested with the fruit grasping approach and thus, it will simplify the robot grasping task. The advantages of the key point-based approach would be:
  1. Deployment of a single NN to get the grasping points, meaning lesser computations as compared with two stages of computations
  2. Lesser time and effort investment in dataset preparation, as for two different networks data annotation in certain format must be prepared and it can be reduced.
  3. Generalization for the fruit harvesting with robots. Some approaches used axis information for grasping pose estimation, nevertheless, every fruit has some unique points like center, stem, bottom, etc. that can be expanded to multiple domains.
RGB-D camera is used in almost all the related works. In some works, it has been attached near the end effector on the robot arm while some works have a fixed camera on the platform. The keypoint detection model would output the coordinates of the fruit points, similar to the center of bounding box coordinates for object detection models. The methodology would be somewhat similar to that of an object detection-based model, nonetheless, the model preparation approach and planning scheme steps are different. An approach is built upon [Ref.] for the keypoint model due to the similarity with the object detection-related works methodology. The methodology takes RGB-D data as input from the camera node, which publishes them on a ROS topic. The key point detection node subscribes to the image topics and performs key point detection on the image. The depth information is then fused to obtain 3D coordinates of the key points and then published them. The goal position and the width information are extracted from the key points. The path for goal position is estimated with IK solvers in MoveIt for path planning. Once the plan is ready, the move group node of MoveIt makes the robot arm traverse to that point with the closing and opening of the gripper, the fruit is grabbed and then dropped at the fixed position.

Research design

The key point detection task is a computer vision task to locate certain key features within an image [Ref.]. The key point detection has been in use in Augmented Reality, face matching, object detection, and so on. Machine Learning detectors like Scale Invariant Feature Transform (SIFT), Oriented FAST and Rotated BRIEF (ORB), etc have been used for matching object key points, however, DL-based models offer more flexibility [PP23] and they have been used successfully with human and animal pose detection [Ref.] . Some of them are intended for RGB image data and some for 3D point cloud data. On the other hand, pose detection models locate the key points in the body like joints, eyes, etc., and then connect them to output the pose of the body [Ref.] . Following are some of the state-of-the-art key point and pose detection models [Ref.], [Ref.] :
  1. Mask R-CNN
  2. CenterNet
  3. Detectron2
  4. Keypoint Detection and Feature Extraction for Point Cloud Registration(Kpsnet)
  5. OpenPose
  6. Simple Vision Transformer Baselines for Human Pose Estimation (ViT-Pose)
  7. Key.Net
  8. YOLOv8 (You Only Look Once version 8)
The models presented above are evaluated with a decision matrix to select the suitable one based on the following points:
  1. Application feasibility: It tells whether the model could be used or adapted for fruits.
  2. Ease of use: It tells whether the model possesses desirable parameters like high detection, ease of training, etc.
  3. Custom tuning feasibility: It informs whether the training data for the model is available or the dataset size has to be large. Some works haven’t been shared publicly and it could impact the model selection.
  4. Novelty: It tells whether the work has already been used in similar works related to fruit grasping.
Harvey Balls method [Ref.] h has been used to evaluate the models. A score value is assigned to each parameter and the model with the highest total score would be selected. It is a visual representation method to use balls for scoring a performance parameter and arrive at results. The three types of balls have been used to score the parameters:
  1. A solid black ball represents that the parameter suits the requirement completely and has one point.
  2. A half-black ball represents that the parameter might fit as the information is not available or not comparable with respect to other models and has a half point.
  3. A solid white ball represents that the parameter doesn’t fit and has a score of zero.

Evaluation of models based on Harvey balls visualization and scoring

YOLOv8 pose estimation has the highest score and it turns out to be the suitable model, and the next section discusses and tries to understand the model in depth.

YOLOv8 Description and Training

YOLOv8 is a multipurpose NN model from Ultralytics [Ref.] and it offers object detection, segmentation, pose estimation, tracking, etc. YOLO series models are famous for their good detection speed and user-friendliness [Ref.] . YOLOv8 pose detects key points first and then associates them to human instances. There are mainly two types in the pose estimation model categories [Ref.]:

  1. Top-down methods: These methods first detect the object and then detect the pose for each object using key points or features. On the one hand, these methods have high complexity and on the other hand, better accuracy.
  2. Bottom-up methods: These methods detect the key points in a single stage, mostly by estimating the heat maps for the key points and relating them to the object via grouping. They are faster than top-down and, nevertheless, are comparatively less accurate.
YOLO-based pose estimation model falls in the category of the top-down method type of pose estimation model [Ref.] . The person is detected first and then the probability, bounding boxes, and key points are processed. The key features and observations related to YOLOv8 pose model are as follows:
  1. The models are intended for humans with 17 key points and tiger pose estimation of 12 key points like eyes, nose, joints, etc. The skeletal structure output is enabled only for 17 key points input [Ref.] .
  2. Only one class of object could be detected at a time. The modification for the class can be performed by providing encoding information in the yaml file.
  3. There are six architectures for YOLOv8 pose estimation models [Ref.] : YOLOv8n-pose (nano), YOLOv8s-pose (small), YOLOv8m-pose (medium), YOLOv8l-pose (large), YOLOv8x-pose (extra large) and YOLOv8x-pose-p6 (extra large with an additional layer).
  4. The model outputs a vector with x, y coordinates and confidence score or visibility for each key point.
  5. The models are trained on the Common Objects in Context (COCO) pose dataset [Ref.] and COCO dataset format and YOLO pose annotation format are compatible.
Fruit Selection: One of the limiting factors in the model is single-class detection at a time, thus, a selection of fruit has to be made to proceed further and the orange fruit has been selected as the target fruit for the use case. A combination of public datasets along with the images captured by the camera has been selected for setting up the model for training. Some of the infested oranges are also considered for better training of the model. As the model is not originally intended for fruits, manual data annotation must be performed in either COCO keypoint or YOLO pose format. [Ref.] is an open-source platform for data annotation and it offers multiple data format export options and the annotation format of COCO key points 1.0 format can be exported, which is converted to YOLO pose format before training [Ref.]. The following public datasets are used for preparing training and validation datasets:
  1. Fruit Images for Object Detection: It is available on Kaggle [Ref.] and is segregated into train and test. It has images of apples, bananas, and oranges and has XML annotation files for object detection. It has varying images from oranges on a tree to oranges with a white background.
  2. Roboflow datasets: Two datasets from the Roboflow platform have been taken [Ref.] [Ref.]. These datasets have around 140 images in total. In these datasets, the images of oranges on a tree and on the farm are present and damaged oranges with scabs and some with white backgrounds are also present.
  3. Images from camera: The images with the camera for robot setup are considered and used for training to tune and deploy the model effectively with the camera settings for real harvesting tasks.
The goal is to detect orange from any configuration and keeping in mind the manual annotation part as well, around 180 images have been selected for training including images from the camera for robot setup and around 50 for validation. The five key points have been selected on fruit for the orange detection with YOLOv8 model: top, center, bottom, left, and right key points in orange.
Picture 2
Data annotation on for sample image
Picture 3
Data annotation on for sample image
Picture 4
Key point detection with YOLOv8n-pose (nano) on sample images
Picture 5
Key point detection with YOLOv8n-pose (nano) on sample images
The losses have shown a decreasing trend and have reached convergence and the precision is high, meaning that out of the detections made by the trained model, the majority are accurate and similar, with recall, out of actual detections in the image, the majority are detected by the model. The mean average precision for the model lies between 80% to 90%, and the detections on the sample images are acceptable for next steps. The larger weights can be considered for avoiding false detections and reliable results, however, with large weights, computation time would have increased. The system on the robot setup doesn’t have high-end computation resources, therefore, proceeding with nano weights.

Performance of YOLOv8 pose nano model: Loss curves on training and validation datasets showing convergence of losses. Precision, recall curves reach close to annotated data.

Vision Models and Multiple Detections

The YOLOv8 object detection model is selected as the baseline model for comparing the approach as the system doesn’t have Nvidia Graphics Processing Unit (GPU) and the computations are going to be performed on the system. The object detection model outputs the center of the bounding box and the YOLOv8 pose outputs the multiple key points along with the bounding box and is similar and two variants of YOLOv8: one with pre-trained weights and the other trained on a similar dataset of YOLO pose for 200 epochs. Three cases are going to be compared:

  1. Fruit harvesting with YOLOv8 pose trained on custom dataset.
  2. Fruit harvesting with YOLOv8 pre-trained model.
  3. Fruit harvesting with YOLOv8 trained on custom dataset.

Comparison of goal position with baseline models.

During fruit harvesting on a tree, fruits are usually adjoined or nearby to one another. Thus, a strategy has to be developed, when multiple fruits are detected. The higher detection probability rate lowers the risk of causing damage to robot setup and also achieves the task accurately. Thus, the priortization is given to high detection probability fruit for goal position for robot motion plan. The workflow has been shown in Figure below. When multiple detections are made by the YOLOv8 models:
  1. The instance with a higher bounding box probability is selected first as the target. Either of the center and bottom key points must be visible and only these two key points are considered for goal position.
  2. Based on the confidence values of bottom and center points with a threshold value of 0.5. One key point is selected in the proposed approach so that there is a good confidence score for the goal position. The center of bounding boxes for YOLOv8 object detection models are selected.
  3. The shape estimation with YOLOv8 pose model is performed with average of distances between the key points or width or diameter estimation. On the other hand, the bounding box width governs the opening of the gripper for YOLOv8 object detection models in the methodology.

Multiple instances detection and goal position selection with YOLOv8 pose.


Once the fruit location and width are available, IK and motion planning is performed with MoveIt library in ROS. Considering the safety and ensuring the robot is ready and all systems are working, following is the action sequence:

  1. Successful publishing of stabilized transforms for orange keypoint or bounding box center from the keypoint detection node.
  2. Setting the robot to harvesting ready position of the arm with a fixed value of joints, picking pose, which indicates that the robot is ready for the grasping task. Changing of the position of target fruit or robotic setup if path planning fails to generate solutions.
  3. Turning the gripper on or opening of the gripper.
  4. Traversing to the goal position and attempting for a grasp by closing the gripper or suction with vacuum gripper. In case of a failed grasping attempt, two more trials are considered.
  5. Moving the robot arm to fruit drop pose with another fixed joint pose near the collection spot, placing pose.
  6. Opening the gripper to put the fruit in a basket or collector box.
  7. Moving back to picking pose and turning the gripper off or close if the task is completed.

Pipeline for robot grasping planning with YOLOv8 pose and object detection models.

Robot Setup

The following components have been used and for testing:

  1. Ufactory xArm5 robotic arm
  2. Intel Realsense d405 depth camera
  3. Inspire Robotics right hand gripper
xArm5 is a five DOF arm, which is manufactured by UFactory and its official catalog [Ref.] states that due to geometric constraints, the arm behaves like a four DOF arm while motion planning in some cases. Its reachability is limited to 0.7 metres and it can support a payload of up to three kilograms. The ROS drivers and software development kit of xArm5 are shared on github [Ref.] . Intel realsense d405 depth camera is a short-range stereo camera with a suitable range in between seven to fifty centimeters with depth, RGB, and point cloud information [Ref.] . Its ROS drivers and its model descriptions have been shared on GitHub [Ref.] and they will be used for integrating it with xArm5. The right hand gripper of Inspire Robotics has six DOF and it can support up to ten to fifteen Newton weight, which is approximately in between on e and two kilograms [Ref.] . the model description has been provided on its official website. The catalog describes twelve joints in the hand, with two per finger/thumb, and the joint motion is mimicked within a finger. Thus, the finger movement is controlled in one axis and the thumb can move around two axes. The technical specifications provide a brief glimpse about the capabilities and the challenges for integration and while performing the manipulation task. In summary, the combined setup has a workspace reachability of around 0.7 meters with limited four DOF for motion planning with IK solvers in space. The holding capacity of the fruit is limited by the finger load-bearing limit for safe operation. The suitable range for the camera and object detection task is within the workspace limits, however, it would not be covering the whole workspace, so the complete integration would answer how the system behaves as one.

Usecase Simulation

The github [Ref.] repository provides the URDF and the MoveIt packages and the specifications and the constraints are already added. However, the geometric constraints have limited its usage for autonomous operation. The trials were conducted with MoveIt and the Software Development Kit of xArm5 in Gazebo and on the real arm, nevertheless, the IK solvers could not solve the configuration for all the goal states. The following approaches were tried and tested:

  1. Adjusting IK solvers: The MoveIt provides a multitude of solvers like Rapidly- exploring Random Trees (RRT), Bi-directional Transition-based Rapidly-exploring Random Trees (Bi-TRRT) in Open Motion Planning Library (OMPL), etc., and altering them have not solved the problem as they didn’t generate solutions for the goal poses in most of the cases.
  2. Analytical solver plugin test: The IK-Fast plugin provided by MoveIt provides an option to integrate the solver which has to be run on either Docker or an older version of ROS. The analytical solver has failed in the Docker image and it has not generated solutions due to constraints in the geometry [Ref.] .
  3. Approximation in joint space planning: A function was written and tested by excluding the first and last link of xArm5 as the other three joints move in one plane. The joint states would then be used to perform the arm movement along with the other two joints. The multiple solutions for the IK by this assumption and filtering them out is not a feasible option.
  4. Addition of joint tolerances for IK solvers: By adding some tolerances in the goal configuration and using the position only IK, the solvers generate good solutions for almost all goal poses [Ref.] .
  5. Cartesian path planning: The Cartesian planning approach takes the points as input and plans the motion to reach the goal configuration. Without the tolerances and position only IK, it could not generate good solutions. By splitting the goal position into x, y, and z points respectively positions with tolerances, the motion planning has worked fine.
The camera is mounted on the xArm5 last link with a bracket and it rotates along with the fifth joint of the arm. Extra tolerances are added in the box geometry for safe. operation during motion planning to ensure collision-free operation. The transform frame is added for the camera geometry with the addition in URDF, which would be used for the transformation of fruit with respect to the camera to fruit with respect to the robot base. The d405 camera model is integrated into the URDF with the meshes for the visual tag and a box geometry is assumed for the collision to reduce the computation load. The gripper geometry is a combined model in STEP format and it is separated part by part in Fusion360 software to integrate the joints and movement of fingers and thumb for motion planning and grasping. In finger joints, joint rotation is mimicked for the fingertip if the finger base is rotated. The thumb can rotate on two axes, in one rotation, there are three joints mimicking the motion of the thumb base and for the other joint rotation, the whole thumb rotates. The driver publisher is created in C++ and the joints and parts are added in the driver publisher for two joints per finger and four joints in the thumb and with the help of a mathematical function, the state from zero to thousand of the joints is converted to the angular values. The angles are then published on the joint states topic to perform movements and collision checks with MoveIt. The end effector point for motion planning is selected just above the palm so that the gripper should not collide with the orange while traversing. The complete setup, after integration of geometry and transforms, is tested with MoveIt planning interface with RViz plugin
Picture 2
Keypoint transform publish
Picture 3
Visual geometry of gripper
Picture 4
Collision geometry of gripper
Picture 5
Added gripper transform frames
Picture 6
Gripper synchronization testing
Picture 7
Visual geometry of robot setup
Picture 8
Collision geometry of robot setup
Picture 9
Motion planning in Gazebo and RViz

Usecase Hardware

  1. Support:The robot setup is planned to be mounted on a mobile platform for outdoor testing and for safety considerations in indoor testing, the robot has been mounted for testing on horizontal fixed support.
  2. Robot start pose and fruit dropping pose:A set of predefined positions is set for the fruit harvesting task. The picking pose is the ready state for fruit harvesting with the palm open and the last joint of the arm horizontal to the ground. The placing pose is the fruit-dropping state with the palm open and the last joint of the arm with 90 degrees rotation
  3. Path planning method for goal position:Once the 3D transform frame of the fruit is published, the x,y, and z positions are split into three parts in the robot base coordinate system for the cartesian path planning method: x-axis displacement, y-axis displacement, and z-axis displacement. In Pose planning method, the Bi-TRRT planner of MoveIt has performed better during trials with planners and it is chosen for motion planning with tolerances to reach goal position autonomously. The cartesian planning method has been used for motion planning to avoid the dependency on the pose angles and the limited DOF due to the geometrical constraints of the arm.
  4. Gripper closing control:The mathematical function to convert gripper opening in terms of angles is defined to control the closing of the gripper with the key points and bounding box width parameters. In case, there are errors in width transform publishing, the gripper is set to close to a half-closed closing configuration.
Picture 2
Picking ready state
Picture 3
Fruit placing state
Picture 4
Gripper grasping oranges of big size
Picture 5
Gripper grasping oranges of small size

Usecase Observations

The speed has been set low during trials to avoid any mishappenings and damage to the robot arm and other components. The cartesian planning method has been used for motion planning to avoid the dependency on the pose angles and the limited DOF due to the geometrical constraints of the arm. An average of 20 trials per model for each of the cases, i.e. 60 trials for each case for three models has been performed. In total, 180 trials have been conducted to evaluate the models while maintaining the same challenge or environment for all the models for each attempt.The performance of the approach has been evaluated in the following cases:

  1. Orange grasping and placing without any occlusion.
  2. Orange grasping and placing with occlusion.
  3. Orange grasping and placing from a small plant.
Picture 2
Picture 3
Picture 4
Picture 5
Following are the outcomes of the trials along with the analysis of the results:
  1. Pre-trained YOLOv8 model has performed poorly when it comes to detection accu racy and grasping attempts. The reason is that the Calamondin orange fruit has a smaller size when compared with other varieties of oranges. Whereas the custom- trained YOLOv8 object detection and pose model have performed better than the pre-trained model with YOLOv8 pose having higher detection rates. The overall detection accuracy with YOLOv8 pose averages more than 50% in the test cases including occlusion cases and five to seven percent higher than YOLOv8 custom- trained object detection model.
  2. The successful grasping attempts with YOLOv8 pose have been the highest in each of the cases and YOLOv8 custom-trained is closer while performing the grasping. However, most of the time fruit has fallen off prematurely before reaching the placing pose with YOLOv8 custom-trained, thus, causing damage to the fruit. The fruit shape approximation is more accurate with YOLOv8 pose model.
  3. Custom trained YOLOv8 object detection model has performed better with occlusion cases with stable grasping than without occlusion cases, however, it has caused damage to the surrounding fruit as well. The higher stability of grasp is because of the less region visibility of orange, the smaller bounding box width of the box governs the opening of the gripper with a tighter grip, and the damage to fruits has increased.
  4. The leaves and branches have presented obstructions in some cases during the fruit grasping and if the grasp is not stable, the fruit would either fall while plucking or be damaged. With the low detection rates with pre-trained YOLOv8 model, most of the time the detected fruit is located on the outer boundary of the plant, and hence, the damage cases to fruits are none.
  5. The depth estimation of the camera on boundaries of the field of view, which is near the rectangular boundaries in the image and just outside the specified range of the camera, is not accurate. It has caused variations in the goal position and has impacted the path planning and stable grasping for all models.
  6. For non-occlusion cases with YOLOv8 pose model near the rectangular boundary of the field of view, the width transforms have not published consistently, the gripper has switched to predefined closing value in the absence of width transform and it has resulted in unstable grasping. However, the fruit has been dropped in the box in some instances a bit earlier than the placing pose and it has been considered an unstable grasp.
The key point based methodology with YOLOv8 pose has shown:
  1. Custom trained YOLOv8 pose is five to seven percent more accurate in terms of detection accuracy when compared with custom trained YOLOv8 object detection model.
  2. The fruit grasping success when performed on the small plant is almost double when compared with object detection models.
  3. The fruit shape approximation is better than with the bounding box-based approximations and it has caused lesser damage to fruits.
  4. The harvesting success lies in the range of 60% and 70 % for the test cases in indoor cases. Whereas with object detection models, the harvesting success rate falls in between 30% and 60%.
Picture 2
Autonomous grasping of orange from any position
Picture 3
Autonomous grasping of orange from plant
Picture 4
Motion planning in simulation
Picture 5
Camera view while performing orange grasping


Multiple key points: top, center, bottom, left, and right have been used to estimate the shape of the fruit and a provision has been made to harvest the fruit from either the center or bottom key points based on high detection accuracy, which has enabled to perform fruit harvesting in any view. YOLOv8 pose model has been trained on public and custom datasets and it has been evaluated with pre-trained and custom-trained YOLOv8 object detection models. YOLOv8 pose has performed better than the baseline models in terms of detection accuracy and ensuring a stable grasp. The model has a higher harvesting success percentage as compared to the baseline models when testing is conducted for similar targets and situations. Due to limited resources, small weights have been used to reduce the training time and inference time for the harvesting task. The approach has been tested and evaluated successfully with the robot setup in the laboratory on the orange plant. The following are the benefits of the proposed approach:

  1. The approach detects the key points with the fruits in a single stage and lesser effort is required as compared with two-stage approaches, which perform the detection of fruit first and then use the detections to isolate the point cloud of fruits.
  2. The fruits with bottom and center key points are proposed for goal position with the proposed methodology and there is no dependency on any fixed side or bottom view.
  3. The methodology has performed better as compared with the object detection model based approach.
The following are the limitations of the proposed approach:
  1. YOLOv8 pose model detects only one class. It is not possible to train one model for multiple classes of fruits.
  2. The methodology with the setup has been tested for small region movements due to the working range limits of the camera and geometry constraints of the arm.
  3. The methodology has been tested on a small plant and requires testing on a tree or on a farm where the occlusion and obstruction from branches, and leaves are higher.


The key point-based methodology has been present in the fruit harvesting works, and most of them revolve around determining the mask and estimating the centroid or fixed pixel distance and so on. Multiple key point detections with computer vision models have been tested before for estimating fruit and vegetable shape, for instance, with Detectron2 [Ref.], however, they have not been deployed in the scope of fruit harvesting with robots. The latest version of the human pose estimation model from Ultralytics, YOLOv8 pose, has been tuned for key points detection in oranges directly without any post-processing. The devised approach estimates the shape based on averaging out the distance between key points to approximate the fruit and ensures that the fruit whether visible in the side view, bottom view, or any other view, the robot arm could approach the fruit and perform the harvesting task successfully.

Future works

The key point estimation model-based approach has been tested and compared with the object detection models. The methodology could be evaluated with the point cloud-based approaches or with segmentation model-based approaches. Testing in the farm and making the complete process autonomous is the next step, with the mobile base navigating in a prescribed area on a map or with outdoor navigation with the Global Position System (GPS), height adjustment of the robot arm with prismatic joint setup, and performing the fruit harvesting. A possible improvement in the approach could be the use of large NN weights, which on one hand, would stabilize the detections and would require higher computation resources.


Our Instagram FEEDS

Social Media

Why not follow us on one of the following social media channels?