ROBOT PYTHON: python data analysis

A well-liked open-source Python data manipulation and analysis library is called pandas. It offers the user-friendly functionalities and data structures required to work with structured data with ease. The following are some essential features of the pandas library and their importance:

Data Frame and Data Series:

Data Frame: The Data Frame, a two-dimensional table with rows and columns, is the main data structure in pandas. It makes it possible for you to effectively store and handle labeled, structured data.

Series: Any kind of data can be stored in this one-dimensional labelled array. The fundamental units of Data Frames are series.

Cleaning and Data Preparation:

Pandas offers strong data preparation and cleaning tools. It has tools for dealing with missing data, rearranging data, and changing the sorts of data.

To handle missing or inaccurate data, methods like dropna(), fillna(), and replace() are frequently utilized.

Data Analysis and Exploration: With the use of summary functions and descriptive statistics, Pandas makes it simple to explore data. For numerical columns, the describe() technique yields statistical summaries; for categorical data, value_counts() is helpful.

Pandas makes it simple to group, aggregate, and filter data, facilitating effective data analysis.

Time Series Analysis: Time series data is well supported by Pandas. It is ideal for studying time-dependent data since it offers capabilities for resampling, time-based indexing, and moving window statistics.

Data joining and merging: In data analysis, combining data from many sources is a frequent operation. Pandas has a number of join and merge functions, including concat() and merge().

Data Input/Output: Pandas can read and write data in a number of different forms, such as Excel, CSV, SQL databases, JSON, and more. Data import and export between sources is now simple as a result.

Flexibility and Performance:

Pandas is made to be user-friendly and flexible. It offers complex functions for more experienced users, but also makes data manipulation accessible to newcomers with its high-level interface.

Python is based on NumPy, a high-performance numerical computing toolkit, internally. This guarantees effective data processing, particularly with regard to big datasets.

Combining with Different Libraries:

NumPy, Matplotlib, Seaborn, scikit-learn, and other data science and machine learning libraries in the Python environment are among the libraries with which Pandas interacts well. A thorough and effective data analysis procedure is made possible by this flawless connection.

Acquiring knowledge of the Pandas library in Python can be beneficial for manipulating data, particularly in data science and analysis activities. Pandas offers a range of methods for data manipulation, cleaning, and analysis in addition to data structures like Data Frames and Series. Here's a step-by-step tutorial to get you started learning about pandas:

To get data file, kindly click here:

https://drive.google.com/drive/folders/1O6OFRmHvCEP6KmFDUlEOTuKBAth0K4ao?usp=sharing

. Install Pandas:

Make sure you have Python installed on your system. You can install Pandas using pip:

pip install pandas

2. Import Pandas:

In your Python script or Jupyter notebook, import the Pandas library:

import pandas as pd

3. Data Structures:

a. Series:

A one-dimensional array-like object. You can create a Series from a list, array, or dictionary.

student_marks = [67, 28, 83, 64]

s = pd.Series(student_marks)

print(s)

Output:

0 67

1 28

2 83

3 64

dtype: int64

b. DataFrame:

A two-dimensional table of data with rows and columns.

You can create a DataFrame from a dictionary, NumPy array, or

other data structures.

weather_data = {'Month': ['Jan', 'Feb', ‘Mar’],

'Temp': [25, 30, 35],

'Town': ['San Francisco', 'New York', 'Los Angeles']}

df_new= pd.DataFrame(weather_data)

print(df_new)

Output:

   Month  Temp           Town

0    Jan    25  San Francisco

1    Feb    30       New York

2  March    35    Los Angeles

4. Reading Data:

Read data from various sources like CSV, Excel, SQL databases, etc.

# CSV

df = pd.read_csv('your_data.csv')

# Excel

df = pd.read_excel('your_data.xlsx')

5. Exploring Data:

Useful functions to explore your data:

# Display basic information about the DataFrame

df.info()

# Summary statistics

df.describe()

# Display the first few rows

df.head()

# Display the last few rows

df.tail()

Practice:

import pandas as pd

# To read the CSV file as a dataframe

file_path=('Lucknow_1990_2022.csv')

df = pd.read_csv('D:\Education Content-2\Apple\Books\pyhthon folder\Lucknow_1990_2022.csv')

# To get few rows of the DataFrame

print(df.head())

# To get basic information of the dataframe.

print(df.info())

# To display the summary statistics of all numerical columns

print(df.describe())

Output:

time  tavg  tmin  tmax  prcp

0  01-01-1990   7.2   NaN  18.1   0.0

1  02-01-1990  10.5   NaN  17.2   0.0

2  03-01-1990  10.2   1.8  18.6   NaN

3  04-01-1990   9.1   NaN  19.3   0.0

4  05-01-1990  13.5   NaN  23.8   0.0

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 11894 entries, 0 to 11893

Data columns (total 5 columns):

 #   Column  Non-Null Count  Dtype

---  ------  --------------  -----

 0   time    11894 non-null  object

 1   tavg    11756 non-null  float64

 2   tmin    8379 non-null   float64

 3   tmax    10341 non-null  float64

 4   prcp    5742 non-null   float64

dtypes: float64(4), object(1)

memory usage: 464.7+ KB

None

               tavg         tmin          tmax         prcp

count  11756.000000  8379.000000  10341.000000  5742.000000

mean      25.221240    18.795859     32.493405     4.535650

std        6.717716     7.197118      6.214145    17.079051

min        5.700000    -0.600000     11.100000     0.000000

25%       19.500000    12.500000     28.100000     0.000000

50%       27.200000    20.500000     33.400000     0.000000

75%       30.400000    25.100000     36.500000     1.000000

max       39.700000    32.700000     47.300000   470.900000

To get the CSV file used here, click the link:

import pandas as pd

# To read the CSV file as a dataframe

file_path=('Lucknow_1990_2022.csv')

df = pd.read_csv('D:\Education Content-2\Apple\Books\pyhthon folder\Lucknow_1990_2022.csv')

# To get few rows of the DataFrame

print(df.head())

# To get basic information of the dataframe.

print(df.info())

# To display the summary statistics of all numerical columns

print(df.describe())

Output:

time tavg tmin tmax prcp

0 01-01-1990 7.2 NaN 18.1 0.0

1 02-01-1990 10.5 NaN 17.2 0.0

2 03-01-1990 10.2 1.8 18.6 NaN

3 04-01-1990 9.1 NaN 19.3 0.0

4 05-01-1990 13.5 NaN 23.8 0.0

RangeIndex: 11894 entries, 0 to 11893

Data columns (total 5 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 time 11894 non-null object

1 tavg 11756 non-null float64

2 tmin 8379 non-null float64

3 tmax 10341 non-null float64

4 prcp 5742 non-null float64

dtypes: float64(4), object(1)

memory usage: 464.7+ KB

None

tavg tmin tmax prcp

count 11756.000000 8379.000000 10341.000000 5742.000000

mean 25.221240 18.795859 32.493405 4.535650

std 6.717716 7.197118 6.214145 17.079051

min 5.700000 -0.600000 11.100000 0.000000

25% 19.500000 12.500000 28.100000 0.000000

50% 27.200000 20.500000 33.400000 0.000000

75% 30.400000 25.100000 36.500000 1.000000

max 39.700000 32.700000 47.300000 470.900000

6. Data Manipulation:

a. Selection and Filtering:

# Selecting a column

df['Name']

column_read=df['tavg']

print(column_read)

Output:

0 7.2

1 10.5

2 10.2

3 9.1

4 13.5

...

11889 27.4

11890 28.1

11891 30.3

11892 30.0

11893 27.1

Name: tavg, Length: 11894, dtype: float64

Output:

# Selecting multiple columns

df[['Name', 'Age']] #sample code, it is not executed.

column_read=df[['tavg','tmin']]

print(column_read)

Output:

        tavg  tmin

0       7.2   NaN

1      10.5   NaN

2      10.2   1.8

3       9.1   NaN

4      13.5   NaN

...     ...   ...

11889  27.4  25.1

11890  28.1  26.1

11891  30.3  26.2

11892  30.0  28.1

11893  27.1  24.1

[11894 rows x 2 columns]

# Filtering rows

df[df['Age'] > 30] #sample code, it is not executed.

filtered_df = df[df['tavg'] > 20]

print(filtered_df)

Output:

time tavg tmin tmax prcp

18 19-01-1990 20.5 13.0 29.5 NaN

27 28-01-1990 20.7 NaN 28.8 0.0

28 29-01-1990 21.4 NaN 28.8 0.0

37 07-02-1990 20.7 10.1 25.9 0.0

38 08-02-1990 20.7 12.9 26.1 NaN

... ... ... ... ... ...

11889 21-07-2022 27.4 25.1 33.1 27.3

11890 22-07-2022 28.1 26.1 31.1 16.0

11891 23-07-2022 30.3 26.2 34.7 11.9

11892 24-07-2022 30.0 28.1 34.7 2.0

11893 25-07-2022 27.1 24.1 34.3 0.5

[8577 rows x 5 columns]

b. Adding and Removing Columns:

# Adding a new column

df['NewColumn'] = df['Age'] * 2

#sample code, it is not executed.

# Removing a column

df.drop('NewColumn', axis=1, inplace=True)

c. Handling Missing Data:

# Check for missing values

df.isnull().sum()

# Drop rows with missing values

df.dropna()

# Fill missing values

df.fillna(value)

7. Grouping and Aggregation:

# Group by a column and calculate mean

df.groupby('City')['Age'].mean()

import pandas as pd

# To read the CSV file as a dataframe

file_path=('Lucknow_1990_2022.csv')

df = pd.read_csv('D:\Education Content-2\Apple\Books\pyhthon folder\Lucknow_1990_2022.csv')

# To get few rows of the DataFrame

print(df.head())

# To get basic information of the dataframe.

print(df.info())

# To display the summary statistics of all numerical columns

print(df.describe())

df_filled = df.fillna(value=0)

OUTPUT:

         time  tavg  tmin  tmax  prcp

0  01-01-1990   7.2   NaN  18.1   0.0

1  02-01-1990  10.5   NaN  17.2   0.0

2  03-01-1990  10.2   1.8  18.6   NaN

3  04-01-1990   9.1   NaN  19.3   0.0

4  05-01-1990  13.5   NaN  23.8   0.0

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 11894 entries, 0 to 11893

Data columns (total 5 columns):

 #   Column  Non-Null Count  Dtype

---  ------  --------------  -----

 0   time    11894 non-null  object

 1   tavg    11756 non-null  float64

 2   tmin    8379 non-null   float64

 3   tmax    10341 non-null  float64

 4   prcp    5742 non-null   float64

dtypes: float64(4), object(1)

memory usage: 464.7+ KB

None

               tavg         tmin          tmax         prcp

count  11756.000000  8379.000000  10341.000000  5742.000000

mean      25.221240    18.795859     32.493405     4.535650

std        6.717716     7.197118      6.214145    17.079051

min        5.700000    -0.600000     11.100000     0.000000

25%       19.500000    12.500000     28.100000     0.000000

50%       27.200000    20.500000     33.400000     0.000000

75%       30.400000    25.100000     36.500000     1.000000

max       39.700000    32.700000     47.300000   470.900000

# Grouping data by a column and calculating mean

grouped_df = df.groupby('time')['tavg'].mean()

grouped_df

OUTPUT:

time

01-01-1990 7.2

01-01-1991 11.5

01-01-1992 9.9

01-01-1993 14.4

01-01-1994 14.0

...

31-12-2017 12.4

31-12-2018 13.3

31-12-2019 8.4

31-12-2020 9.9

31-12-2021 13.9

Name: tavg, Length: 11894, dtype: float64

# Aggregating data with multiple functions

agg_df = df.groupby('time').agg({'tavg': ['mean', 'sum']})

agg_df

grouped_df.size

OUTPUT:

tavg
	mean	sum
time
01-01-1990	7.2	7.2
01-01-1991	11.5	11.5
01-01-1992	9.9	9.9
01-01-1993	14.4	14.4
01-01-1994	14.0	14.0
...	...	...
31-12-2017	12.4	12.4
31-12-2018	13.3	13.3
31-12-2019	8.4	8.4
31-12-2020	9.9	9.9
31-12-2021	13.9	13.9

11894 rows × 2 columns

grouped_df.size

OUTPUT:

grouped_df.index

OUTPUT:

Index(['01-01-1990', '01-01-1991', '01-01-1992', '01-01-1993', '01-01-1994',

'01-01-1995', '01-01-1996', '01-01-1997', '01-01-1998', '01-01-1999',

...

'31-12-2012', '31-12-2013', '31-12-2014', '31-12-2015', '31-12-2016',

'31-12-2017', '31-12-2018', '31-12-2019', '31-12-2020', '31-12-2021'],

dtype='object', name='time', length=11894)

8. Data Visualization:

Pandas integrates well with Matplotlib and Seaborn for data visualization:

import matplotlib.pyplot as plt

import seaborn as sns

# Histogram

import matplotlib.pyplot as plt

df['tavg'].hist()

plt.show()

OUTPUT:

# Scatter plot

import seaborn as sns

sns.scatterplot(x='tmin', y='tmax', data=df)

plt.show()

OUTPUT:

9. Practice:

#The manager calls the Analytical Engineer to go through the following data thoroughly and then answer the questions asked.

a)Which day has the highest average temperature? 10-06-2014

b)Which day has the lowest average temperature? 07-02-2008

import pandas as pd

# To read the CSV file as a dataframe

file_path=('Mumbai_1990_2022_Santacruz.csv')

df = pd.read_csv('D:\Education Content-2\Apple\Books\pyhthon folder\Mumbai_1990_2022_Santacruz.csv')

#To find the maximum average value

tmax1='tavg'

max_value=df[tmax1].max()

print(max_value)

maxvalue_row=df.loc[df[tmax1] ==max_value]

print(maxvalue_row)

#To find the minimum average value

tmin1='tavg' # repeat the same step as done for ‘max_value’

min_value=df[tmin1].min()

print(min_value)

minvalue_row=df.loc[df[tmin1]==min_value]

print(minvalue_row)

OUTPUT:

33.7

time tavg tmin tmax prcp

8926 10-06-2014 33.7 NaN NaN NaN

17.7

time tavg tmin tmax prcp

6611 07-02-2008 17.7 NaN 23.2 NaN

How much data is there in this data base, before and after cleaning the data?

nan_value=df.isna()

print(nan_value)

#drop row which hold nan value

df_filled=df.dropna()

print(df_filled)

OUTPUT:

time tavg tmin tmax prcp

0 False False False True False

1 False False False False False

2 False False False False False

3 False False False False False

4 False False False False False

... ... ... ... ... ...

11889 False False False False False

11890 False False False False False

11891 False False False False False

11892 False False False False False

11893 False False False False False

[11894 rows x 5 columns]

time tavg tmin tmax prcp

1 02-01-1990 22.2 16.5 29.9 0.0

2 03-01-1990 21.8 16.3 30.7 0.0

3 04-01-1990 25.4 17.9 31.8 0.0

4 05-01-1990 26.5 19.3 33.7 0.0

5 06-01-1990 25.1 19.8 33.5 0.0

... ... ... ... ... ...

11889 21-07-2022 27.6 25.6 30.5 10.9

11890 22-07-2022 28.3 26.0 30.5 3.0

11891 23-07-2022 28.2 25.8 31.3 5.1

11892 24-07-2022 28.1 25.6 30.4 7.1

11893 25-07-2022 28.3 25.1 30.2 7.1

[4623 rows x 5 columns]

Which day recorded the lowest temperature? 08-02-2008

#To find the minimum average value

tmin1='tmin' # repeat the same step as done for ‘tmax’ column

min_value=df[tmin1].min()

print(min_value)

minvalue_row=df.loc[df[tmin1]==min_value]

print(minvalue_row)

OUTPUT:

8.5

time tavg tmin tmax prcp

6612 08-02-2008 17.9 8.5 22.3 NaN

Which day recorded the highest temperature? 16-03-2011

tmax_value='tmax'

tmax_value1=df[tmax_value].max()

print(tmax_value1)

OUTPUT:

41.3

tmax_row=df.loc[df[tmax_value]==tmax_value1]

print(tmax_row)

time tavg tmin tmax prcp

7744 16-03-2011 32.8 19.2 41.3 NaN

ROBOT PYTHON

Friday, December 8, 2023

CHAPTER 9 PYTHON LIBRARY SERIES-PANDAS

CHAPTER 18 EXPLORING THERMODYNAMICS WITH PYTHON: UNDERSTANDING CARNOT'S THEOREM AND MORE

Contact Form

Report Abuse