Skip to content

Clearning, transformation and analysis large datasets as part of coursework for UCS1629: Data Warehousing and Data Mining.

Notifications You must be signed in to change notification settings

karthik-d/Data-Mining_Preprocessing-Analysis

Repository files navigation

Data-Mining_Preprocessing-Analysis

Clearning, transformation and analysis large datasets as part of coursework for UCS1629: Data Warehousing and Data Mining.

Dataset

This dataset is a subset of the Google BigQuery public datasets - Nyc yellow taxi cab trips data set containing a random 10,000,000 rows of data. This dataset was extracted and uploaded for the purpose of experimenting with and learning regression models for price prediction. There is also a lot of room for data cleaning, outliers in the data, and plenty of data to work with for more realistic model training, testing, and validation.

The analyzed subset of the data is publicly accessible through Kaggle.

Data Attributes

column type nullable description
vendor_id text required A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc
pickup_datetime datetime nullable The date and time when the meter was engaged.
dropoff_datetime datetime nullable The date and time when the meter was disengaged.
passenger_count integer nullable The number of passengers in the vehicle. This is a driver-entered value
trip_distance numeric nullable The elapsed trip distance in miles reported by the taximeter.
rate_code string nullable The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride
storeandfwd_flag string nullable This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip
payment_type string nullable A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip
fare_amount numeric nullable The time-and-distance fare calculated by the meter
extra numeric nullable Miscellaneous extras and surcharges. Currently, this only includes the \$0.50 and \$1 rush hour and overnight charges.
mta_tax numeric nullable \$0.50 MTA tax that is automatically triggered based on the metered rate in use
tip_amount numeric nullable Tip amount – This field is automatically populated for credit card tips. Cash tips are not included
tolls_amount numeric nullable Total amount of all tolls paid in the trip.
imp_surcharge numeric nullable \$0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.
total_amount numeric nullable The total amount charged to passengers. Does not include cash tips
pickuplocationid string nullable TLC Taxi Zone in which the taximeter was engaged
dropofflocationid string nullable TLC Taxi Zone in which the taximeter was disengaged

Basic Analysis Steps

Data Cleaning Steps

Data Transformation Steps

About

Clearning, transformation and analysis large datasets as part of coursework for UCS1629: Data Warehousing and Data Mining.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published