privkit.data package#
- class privkit.data.DataType#
Bases:
ABC
DataType is an abstract class for a generic type of data. Defines a series of methods common to all data types. Provides basic functions to load, process, and save data. Requires the definition of the DATA_TYPE_ID, DATA_TYPE_NAME, and DATA_TYPE_INFO.
- property DATA_TYPE_ID: str#
Identifier of the data type
- property DATA_TYPE_INFO: str#
Information of the data type, specifically the format of the files to be read
- property DATA_TYPE_NAME: str#
Name of the data type
- abstract load_data(*args)#
Loads data. This is specific to the data type
- abstract process_data(*args)#
Performs data processing or returns data processing methods. This is specific to the data type
- abstract save_data(*args)#
Saves data to a file. This is specific to the data type
- class privkit.data.FacialData(id_name: str | None = None)#
Bases:
DataType
FacialData is a privkit.DataType to handle facial data. Facial data is defined as a collection of points in 3D space, where each point is represented by its coordinates (x, y, z) and, optionally, additional attributes like color or normal vectors. It is stored as an Open3D data structure with the PointCloud class.
- DATA_TYPE_ID = 'facial_data'#
- DATA_TYPE_INFO = 'Facial data can be imported through an Open3D data structure or read by a PLY file. To be supported, data should contain at least one point (x, y, z).'#
- DATA_TYPE_NAME = 'Facial Data'#
- crop_pcd(xmin: float, xmax: float, ymin: float, ymax: float, zmin: float, zmax: float)#
Segment the point cloud with a bounding box
- Parameters:
xmin (float) – minimum x-coordinate of the bounding box
xmax (float) – maximum x-coordinate of the bounding box
ymin (float) – minimum y-coordinate of the bounding box
ymax (float) – maximum y-coordinate of the bounding box
zmin (float) – minimum z-coordinate of the bounding box
zmax (float) – maximum z-coordinate of the bounding box
- data#
Facial data is stored as an Open3D data structure
- fp_downsample(N: int)#
Downsample the point cloud with the Farthest Point Sampling technique
- Parameters:
N (int) – number of point of the sampled point cloud
- get_color()#
Returns a boolean value indicating whether the point cloud has color :return: True if the point cloud has color, False otherwise
- get_number_of_points()#
Returns the number of points of the point cloud :return: number of points of the point cloud
- get_point_mean_std()#
Returns the average coordinate of the point cloud points along with the standard deviation :return: Average point cloud coordinate and standard deviation
- get_point_median()#
Returns the median coordinate of the point cloud points :return: median point cloud coordinate
- id_name#
Identifier name of this facial data instance
- load_data(pcd_or_filepath: str)#
Loads facial data from a PointCloud, an array with dimensions (#dim, 3) or a file that can be read using Open3D’s read_point_cloud() method.
- Parameters:
pcd_or_filepath (DataFrame or str or Path) – either a PointCloud instance, a file path to a point cloud, or an array with the points coordinates
- print_data_summary()#
Prints data summary, specifying the number of points, and color availability
- process_data()#
Performs data processing or returns data processing methods. This is specific to the data type
- remove_outliers_statistical(nb_neighbors: float, std_ratio: float)#
Remove outlier from the point cloud based on neighboring distance
- Parameters:
nb_neighbors (float) – number of neighbors for outlier detection
std_ratio (float) – standard deviation for threshold computation
- remove_points_outside_sphere(center: numpy.ndarray, radius: float)#
Segment the point cloud with a sphere
- Parameters:
center (ndarray) – coordinates of the center of the sphere
radius (float) – radius of the sphere
- save_data(filepath: str = './input/data/', filename: str | None = None, extension: str = 'ply')#
Saves data to a file.
- Parameters:
filepath (str) – path where data should be saved.
filename (str) – name of the file to be saved.
extension (str) – extension of the format of how the file should be saved. The default value is ‘ply’.
- class privkit.data.LocationData(id_name: str | None = None)#
Bases:
DataType
LocationData is a privkit.DataType to handle location data. Location data is defined by a <latitude, longitude> coordinates (and optionally datetime) and is stored as a Pandas DataFrame.
- DATA_TYPE_ID = 'location_data'#
- DATA_TYPE_INFO = 'Location data can be imported through a Pandas dataframe or a read by a delimited file, a file-like object or an object. To be supported, data should contain at least one point (latitude, longitude). '#
- DATA_TYPE_NAME = 'Location Data'#
- average_update_rate()#
Returns the average of update rate (i.e. timedelta between subsequent points) :return: average update rate
- compute_timedelta(time_unit: str = 's', boxplot: bool = False) List #
Computes the timedelta between the datetime of the trajectory points
- Parameters:
time_unit (str) – time unit (e.g. seconds, minutes or hours). The default is s (seconds).
boxplot (bool) – if True, a boxplot is generated
- Returns:
list of timedelta values
- create_grid(min_lat: float, max_lat: float, min_lon: float, max_lon: float, spacing: float, timestamp: int | None = None)#
Discretizes the space defined by the min and max latitude and longitude of the location data
- Parameters:
min_lat (float) – minimum latitude coordinate
max_lat (float) – maximum latitude coordinate
min_lon (float) – minimum longitude coordinate
max_lon (float) – maximum longitude coordinate
spacing (float) – grid cell spacing in meters
timestamp (int) – time interval
- data#
Location data is stored as a pd.DataFrame
- divide_data(test_size: float = 0.2)#
Divides data into train and test :param test_size: size of test data
- filter_by_distance(min_distance: float = 0, max_distance: float = 2000)#
Filters trajectories by distance to avoid either extremely long or short trajectories
- Parameters:
min_distance (float) – minimum distance that the trajectory must have. The default is 0 meters.
max_distance (float) – maximum distance that the trajectory must have. The default is 2000 meters = 2 km.
- filter_by_duration(min_duration: float = 60, max_duration: float = 7200, time_unit: str = 's')#
Filters trajectories by duration to avoid either extremely long or short trajectories
- Parameters:
min_duration (float) – minimum duration that the trajectory must have. The default is 60 seconds.
max_duration (float) – maximum duration that the trajectory must have. The default is 2 hours = 2x3600 seconds.
time_unit (str) – time unit (e.g. seconds, minutes or hours). The default is s (seconds).
- filter_by_timedelta(timedelta: float, time_unit: str = 's')#
Filters trajectories by timedelta to avoid discontinuity between points
- Parameters:
timedelta (float) – defines the maximum timedelta between subsequent points
time_unit (str) – time unit (e.g. seconds, minutes or hours). The default is s (seconds).
- filter_outside_points(min_latitude: float, max_latitude: float, min_longitude: float, max_longitude: float)#
Filters all location points that fall outside the given latitude/longitude grid/bounding-box. Note: this can produce time gaps
- Parameters:
min_latitude (float) – minimum latitude coordinate
max_latitude (float) – maximum latitude coordinate
min_longitude (float) – minimum longitude coordinate
max_longitude (float) – maximum longitude coordinate
- generate_ground_truth(mechanism: str, G: networkx.MultiDiGraph | None = None, sigma: float = 6.86, error_range: float = 50, lambda_y: float = 0.69, lambda_z: float = 13.35)#
Generates ground truth by applying a mechanism.
For the Map-Matching mechanism, the following references were used:
[1] Goh, C. Y., Dauwels, J., Mitrovic, N., Asif, M. T., Oran, A., & Jaillet, P. (2012, September). Online map-matching based on hidden markov model for real-time traffic sensing applications. In 2012 15th International IEEE Conference on Intelligent Transportation Systems (pp. 776-781). IEEE.
[2] Jagadeesh, G. R., & Srikanthan, T. (2017). Online map-matching of noisy and sparse location data with hidden Markov and route choice models. IEEE Transactions on Intelligent Transportation Systems, 18(9), 2423-2434.
- Parameters:
mechanism (str) – mechanism identifier that should be used to generate ground truth data
G (networkx.MultiDiGraph) – road network represented as a directed graph
sigma (float) – standard deviation of the location measurement noise in meters. The default value is 6.86 [1].
error_range (float) – defines the range to search for states for each observation point. The default value is 50 [1].
lambda_y (float) – multiplier of circuitousness used to compute the probability of transition. The default value is 0.69 [2].
lambda_z (float) – multiplier of temporal implausibility used to compute the probability of transition. The default value is 13.35 [2].
- get_bounding_box_range()#
Returns the minimum and maximum latitude/longitude cover by data
- Returns:
min_latitude, max_latitude, min_longitude, max_longitude
- get_datetime_range()#
Returns the data range of datatime :return: minimum datetime, maximum datetime
- get_number_of_trajectories()#
Returns the number of trajectories :return: number of trajectories
- get_number_of_users()#
Returns the number of users :return: number of users
- get_test_data()#
Returns test data :return: test data
- get_train_data()#
Returns train data :return: train data
- get_trajectories(user_id: int | None = None, trajectory_id: int | None = None)#
Gets trajectories from the given location data grouped by trajectory id and/or user id.
- Parameters:
user_id (int) – user identifier whose trajectories should be returned.
trajectory_id (int) – trajectory identifier to return
- Returns:
trajectories
- Return type:
pd.DataFrameGroupBy
- get_trajectory_statistics()#
Returns trajectory statistics, specifically the total number of points, trajectory with the minimum and maximum number of points, and the average of points per trajectory.
- Returns:
total_number_of_points, min_number_of_points, max_number_of_points, average_number_of_points
- id_name#
Identifier name of this location data instance
- load_data(data_to_load: str, latitude: str = 'lat', longitude: str = 'lon', datetime: str = 'datetime', user_id: str = 'uid', trajectory_id: str = 'tid', save: bool = False, **kwargs)#
Loads location data from a pd.DataFrame or a file that can be read by a Pandas read() method. The definition of the parameters came from the parameters of Pandas.
- Parameters:
data_to_load (DataFrame or str or Path or object) – either a Pandas Dataframe, a path to a file (str or Path) or any object that can be read from pandas read() methods.
latitude (str or int) – the position or the name of the column containing the latitude. The default is constants.LATITUDE.
longitude (str or int) – the position or the name of the column containing the longitude. The default is constants.LONGITUDE.
datetime (str or int) – the position or the name of the column containing the datetime. The default is constants.DATETIME.
user_id (str or int) – the position or the name of the column containing the user id. The default is constants.UID.
trajectory_id (str or int) – the position or the name of the column containing the trajectory id. The default is constants.TID.
save (bool) – if True, data is saved to a file. The default is False.
kwargs – parameters for the Pandas read methods.
- mm_data_processing(G: networkx.MultiDiGraph, sigma: float = 6.86, error_range: float = 50, lambda_y: float = 0.69, lambda_z: float = 13.35)#
Applies map-matching as a pre-processing method to generate the ground thruth. The default values came from the following papers:
[1] Goh, C. Y., Dauwels, J., Mitrovic, N., Asif, M. T., Oran, A., & Jaillet, P. (2012, September). Online map-matching based on hidden markov model for real-time traffic sensing applications. In 2012 15th International IEEE Conference on Intelligent Transportation Systems (pp. 776-781). IEEE.
[2] Jagadeesh, G. R., & Srikanthan, T. (2017). Online map-matching of noisy and sparse location data with hidden Markov and route choice models. IEEE Transactions on Intelligent Transportation Systems, 18(9), 2423-2434.
- Parameters:
G (networkx.MultiDiGraph) – road network represented as a directed graph
sigma (float) – standard deviation of the location measurement noise in meters. The default value is 6.86 [1].
error_range (float) – defines the range to search for states for each observation point. The default value is 50 [1].
lambda_y (float) – multiplier of circuitousness used to compute the probability of transition. The default value is 0.69 [2].
lambda_z (float) – multiplier of temporal implausibility used to compute the probability of transition. The default value is 13.35 [2].
- original_data#
Original location data is stored as a pd.DataFrame
- print_data_summary(dataset_name: str | None = None)#
Prints data summary, specifying the number of users, trajectories, and other statistics
- Parameters:
dataset_name (str) – dataset name (optional). The default value is None.
- print_statistics_by_user(dataset_name: str | None = None)#
Prints statistics of data by user, specifying the number of trajectories, points, and other statistics per user
- Parameters:
dataset_name (str) – dataset name (optional). The default value is None.
- process_data()#
Performs location data processing. The first step consists of sorting location data by datetime (if user id and trajectory id are columns of the dataframe, it first sorts location data by uid and/or tid).
- resample(user_id: int, start_index: int, end_index: int)#
Given two indexes, calculates the center coordinates as the mean of all report within the interval given by the indexes.
- Parameters:
user_id (int) – user which reports are being resampled
start_index – starting index of the time interval
end_index – ending index of the time interval
- Returns:
mean of the reports in the range of the two given indexes
- resample_by_time(R: int)#
Resample the location reports on the time axis to solve the non-uniformity distribution over time. Within each resampling interval R, calculates the center coordinates as the mean of all reports within that interval.
- Parameters:
R (int) – resampling interval in minutes
- save_data(filepath: str = './input/data/', filename: str | None = None, extension: str = 'pkl')#
Saves data to a file.
- Parameters:
filepath (str) – path where data should be saved.
filename (str) – name of the file to be saved.
extension (str) – extension of the format of how the file should be saved. The default value is ‘pkl’.
- subsample_data(min_timedelta: float, save_subsampled_data: bool = False)#
Subsamples data according to a minimum timedelta between subsequent points
- Parameters:
min_timedelta (float) – minimum timedelta between subsequent points
save_subsampled_data (bool) – if True, subsampled data is saved to a file