Harun Mwenda Mbaabu
Data Science East Africa
5 min readJan 22, 2021

--

Asteroids Classification: Classifying Whether an Asteroid is Hazardous.

Asteroids

Asteroids are small, rocky objects that orbit the Sun. Although asteroids orbit the Sun like planets, they are much smaller than planets. Asteroids are small, rocky objects that orbit the sun. Although asteroids orbit the sun like planets, they are much smaller than planets.

Larger asteroids have also been called planetoids. There exist millions of asteroids and the vast majority of known asteroids orbit within the central asteroid belt located between the orbits of Mars and Jupiter, or are co-orbital with Jupiter (the Jupiter Trojans).

The study of asteroids is also crucial as historical events prove some of them being hazardous. Remember Chicxulub crater? — The crater formed by an asteroid that probably snuffed out all the dinosaurs, 65 million years ago.

Being a data science enthusiast, I thought of using machine learning to predict whether an asteroid could be hazardous or not.
Searching on Kaggle, I found NASA’s dataset about some of the asteroids discovered so far. The dataset contains various information about the asteroids and labels each asteroid as hazardous or non-hazardous.
You can find the dataset here.

ASTEROID DATASET

The data is about Asteroids and is provided by NEOWS(Near-Earth Object Web Service)

The dataset consists of 4687 data instances(rows) and 40 features(columns). Also, there are no null values in the dataset.

Some of the features’ description is given below;

  1. ‘Neo Reference ID’: This feature denotes the reference ID assigned to an asteroid.
  2. ‘Name’: This feature denotes the name given to an asteroid.
  3. ‘Absolute Magnitude’: This feature denotes the absolute magnitude of an asteroid. An asteroid’s absolute magnitude is the visual magnitude an observer would record if the asteroid were placed 1 Astronomical Unit (AU) away, and 1 AU from the Sun and at a zero phase angle.
  4. ‘Est Dia in KM(min)’: This feature denotes the estimated diameter of the asteroid in kilometres (KM).
  5. ‘Est Dia in M(min)’: This feature denotes the estimated diameter of the asteroid in meters(M).
  6. ‘Relative Velocity km per sec’: This feature denotes the relative velocity of the asteroid in kilometre per second.
  7. ‘Relative Velocity km per hr’: This feature denotes the relative velocity of the asteroid in kilometre per hour.
  8. ‘Orbiting Body’: This feature denotes the planet around which the asteroid is revolving.
  9. ‘Jupiter Tisserand Invariant’: This feature denotes the Tisserand’s parameter for the asteroid. Tisserand’s parameter (or Tisserand’s invariant) is a value calculated from several orbital elements(semi-major axis, orbital eccentricity, and inclination) of a relatively small object and a more substantial‘ perturbing body’. It is used to distinguish different kinds of orbits.
  10. ‘Eccentricity’: This feature denotes the value of eccentricity of the asteroid’s orbit. Just like many other bodies in the solar system, the realms made by asteroids are not perfect circles, but ellipses. The axis marked eccentricity is a measure of how far from circular each orbit is: the smaller the eccentricity number, the more circular the realm.
  11. ‘Semi Major Axis’: This feature denotes the value of the Semi Major Axis of the asteroid’s orbit. As discussed above, the realm of an asteroid is elliptical rather than circular. Hence, the Semi Major Axis exists.
  12. ‘Orbital Period’: This feature denotes the value of the orbital period of the asteroid. Orbital period refers to the time taken by the asteroid to make one full revolution around its orbiting body.
  13. ‘Perihelion Distance’: This feature denotes the value of the Perihelion distance of the asteroid. For a body orbiting the Sun, the point of least distance is the perihelion.
  14. ‘Aphelion Dist’: This feature denotes the value of Aphelion distance of the asteroid. For a body orbiting the Sun, the point of greatest distance is the aphelion.
  15. ‘Hazardous’: This feature denotes whether the asteroid is hazardous or not.

To sum up, the features present in the dataset covers not only the information about the geometry of the asteroid but also its path and speed.

THE APPROACH

It is worth noting the fact that the asteroids, generally, more prominent in size are hazardous than those which are comparatively smaller.
If we consider the mean of the diameter of the asteroids that are labelled as hazardous in this dataset, then it turns out to be 0.70 KM. In contrast, the mean of the diameter of the non-hazardous asteroids turns out to be 0.40 KM.
Hence, we conclude that the dataset supports the general theory.

Let us begin.

Feature Engineering

As one can see, there are several unnecessary features present in the dataset which hardly contribute towards classification.
The features ‘Name’ and ‘Neo Reference ID’ denote the identification number given to an asteroid. These features are not useful for the machine learning model since the name of the asteroid does not contribute to the fact that it is hazardous. Also, both features contain the same values.
Thus, we can delete both features.

The feature ‘Close Approach Date’ is also unnecessary since it gives the date of when the asteroid will be near Earth. The time at which the asteroid was closest to Earth does not contribute to the fact ‘whether’ that asteroid will be hazardous or not. Instead it tells ‘when’. Thus, we are deleting this feature as well.
For a similar reason, we will also delete the ‘Orbit Determination Date’ feature.

Now, let us look at the ‘Orbiting Body’ feature. It only contains one value “Earth”. Hence, deleting this feature also(since a feature with just one value does not contribute to the machine learning technique).
Also, the feature ‘Equinox’ contains only one value ‘J2000’, thus, deleting this feature too.

Consider the features:

‘Est Dia in KM(min)’, ‘Est Dia in KM(max)’,
‘Est Dia in M(min)’, ‘Est Dia in M(max)’,
‘Est Dia in Miles(min)’, ‘Est Dia in Miles(max)’,
‘Est Dia in Feet(min)’, ‘Est Dia in Feet(max)’

All these features represent the estimated diameter of the asteroid in different units, KM = kilometre, M = meter, etc. This is an excellent example of redundant data since it is the same value represented differently. Such redundancy should be removed. The beauty of statistical analysis is that it identifies such errors in the dataset, even if a data scientist misses them. Let us not remove these features now, but have them identified statistically.

The removal of the above features so far was intuition-based. Now let us look at the statistical analysis and find out which features are statistically relevant.

Statistical Analysis

Before, proceeding let us look at the ‘hazardous’ feature. The values are ‘TRUE’ or ‘FALSE’, encoding these to 1 and 0, respectively.

Now, let us form the correlation matrix of the dataset.

Incomplete Story Keep Checking for Updated Sections.

--

--

Harun Mwenda Mbaabu
Data Science East Africa

Software Engineer || Data Scientist || Building Data Science East Africa && Lux Tech Academy