Getting Additional Column While Reading Txt Files in Python Pandas

Nigh of the information is available in a tabular format of CSV files. It is very pop. You can convert them to a pandas DataFrame using the read_csv function. The pandas.read_csv is used to load a CSV file equally a pandas dataframe.

In this commodity, you volition larn the different features of the read_csv function of pandas apart from loading the CSV file and the parameters which can exist customized to get amend output from the read_csv function.

pandas.read_csv

Syntax: pandas.read_csv( filepath_or_buffer, sep, header, index_col, usecols, prefix, dtype, converters, skiprows, skiprows, nrows, na_values, parse_dates)Purpose: Read a comma-separated values (csv) file into DataFrame. Too supports optionally iterating or breaking the file into chunks.
Parameters:
- filepath_or_buffer : str, path object or file-similar object Any valid string path is acceptable. The cord could be a URL too. Path object refers to bone.PathLike. File-like objects with a read() method, such every bit a filehandle (e.k. via built-in open function) or StringIO.
- sep : str, (Default ',') Separating boundary which distinguishes between any two subsequent data items.
- header : int, list of int, (Default 'infer') Row number(s) to use as the column names, and the outset of the data. The default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and cavalcade names are inferred from the get-go line of the file.
- names : array-similar List of column names to utilize. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this listing are non allowed.
- index_col : int, str, sequence of int/str, or Simulated, (Default None) Column(s) to use as the row labels of the DataFrame, either given as string name or column index. If a sequence of int/str is given, a MultiIndex is used.
- usecols : list-like or callable Return a subset of the columns. If callable, the callable function will exist evaluated against the column names, returning names where the callable function evaluates to True.
- prefix : str Prefix to add together to column numbers when no header, e.g. 'Ten' for X0, X1
- dtype : Blazon name or dict of column -> type Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32, 'c': 'Int64'} Use str or object together with suitable na_values settings to preserve and not interpret dtype.
- converters : dict Dict of functions for converting values in certain columns. Keys can either be integers or column labels.
- skiprows : list-like, int or callable Line numbers to skip (0-indexed) or the number of lines to skip (int) at the start of the file. If callable, the callable function will be evaluated confronting the row indices, returning Truthful if the row should exist skipped and False otherwise.
- skipfooter : int Number of lines at bottom of the file to skip
- nrows : int Number of rows of file to read. Useful for reading pieces of large files.
- na_values : scalar, str, list-similar, or dict Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the post-obit values are interpreted equally NaN: '', '#Due north/A', '#N/A N/A', '#NA', '-ane.#IND', '-ane.#QNAN', '-NaN', '-nan', 'i.#IND', 'i.#QNAN', '', 'N/A', 'NA', 'Zilch', 'NaN', 'north/a', 'nan', 'null'.
- parse_dates : bool or list of int or names or list of lists or dict, (default Faux) If set to True, volition try to parse the index, else parse the columns passed

Returns: DataFrame or TextParser, A comma-separated values (CSV) file is returned every bit a two-dimensional information structure with labeled axes. _For full list of parameters, refer to the offical documentation

Reading CSV file

The pandas read_csv function tin be used in different ways as per necessity similar using custom separators, reading only selective columns/rows and and then on. All cases are covered below i after another.

Default Separator

To read a CSV file, call the pandas function read_csv() and pass the file path as input.

Pace 1: Import Pandas

                      import            pandas            as            pd

Stride ii: Read the CSV

                      # Read the csv file            df            = pd.read_csv("data1.csv")            # First 5 rows            df.head()

Different, Custom Separators

By default, a CSV is seperated by comma. Just you can utilize other seperators as well. The pandas.read_csvoffice is not limited to reading the CSV file with default separator (i.e. comma). It can exist used for other separators such as ;, | or :. To load CSV files with such separators, the sep parameter is used to laissez passer the separator used in the CSV file.

Let'southward load a file with | separator

          #            Read            the csv            file            sep='|'            df = pd.read_csv("data2.csv", sep='|') df

Custom Separators for read _csv pandas file

Set any row every bit cavalcade header

Let'southward see the data frame created using the read_csv pandas function without any header parameter:

                      # Read the csv file            df            = pd.read_csv("data1.csv") df.caput()

The row 0 seems to be a better fit for the header. It can explain better about the figures in the table. You lot can make this 0 row as a header while reading the CSV by using the header parameter. Header parameter takes the value as a row number.

Note: Row numbering starts from 0 including column header

                      # Read the csv file with header parameter            df            = pd.read_csv("data1.csv",            header=1)            df.head()

Renaming column headers

While reading the CSV file, you can rename the column headers by using the names parameter. The names parameter takes the list of names of the cavalcade header.

          # Read the csv            file            with names            parameter            df            = pd.read_csv(            "data.csv"            , names=[            'Ranking'            ,            'ST Name'            ,            'Pop'            ,            'NS'            ,            'D'            ])            df.head()

Renaming Column header for read _csv pandas file

To avoid the erstwhile header being inferred as a row for the data frame, you can provide the header parameter which will override the onetime header names with new names.

          # Read the csv            file            with header            and            names            parameter            df            = pd.read_csv(            "data.csv"            , header=0, names=[            'Ranking'            ,            'ST Name'            ,            'Pop'            ,            'NS'            ,            'D'            ])            df.caput()

Loading CSV without column headers in pandas

There is a chance that the CSV file you lot load doesn't have whatsoever column header. The pandas will make the first row as a cavalcade header in the default instance.

                      # Read the csv file            df            = pd.read_csv("data3.csv") df.head()

To avoid any row being inferred equally column header, you can specify header as None. It will strength pandas to create numbered columns starting from 0.

                      # Read the csv file with header=None            df            = pd.read_csv("data3.csv",            header=None)            df.caput()

Adding Prefixes to numbered columns

Y'all can also give prefixes to the numbered column headers using the prefix parameter of pandas read_csv function.

                      # Read the csv file with header=None and prefix=column_            df            = pd.read_csv("data3.csv",            header=None,            prefix='column_')            df.head()

Set any column(s) every bit Index

By default, Pandas adds an initial alphabetize to the data frame loaded from the CSV file. Y'all can control this behavior and make any cavalcade of your CSV as an index by using the index_col parameter.

It takes the name of the desired column which has to be made as an index.

Case 1: Making i column as index

          # Read the csv file            with            'Rank'            as            alphabetize df = pd.read_csv("data.csv", index_col='Rank') df.caput()

Case 2: Making multiple columns every bit index

For two or more columns to be made as an index, pass them every bit a list.

          # Read the csv            file            with            'Rank'            and            'Appointment'            as            index            df = pd.read_csv("data.csv", index_col=['Rank',            'Engagement']) df.head()

Selecting columns while reading CSV

In practice, all the columns of the CSV file are not important. Yous tin can select but the necessary columns later loading the file but if you're enlightened of those beforehand, yous can salve the infinite and time.

usecols parameter takes the list of columns you lot want to load in your data frame.

Selecting columns using listing

          #            Read            the csv file            with            'Rank',            'Date'            and            'Population'            columns (list) df = pd.read_csv("information.csv", usecols=['Rank',            'Appointment',            'Population']) df.head()

Selecting column for read_csv pandas file

Selecting columns using callable functions

usecols parameter can as well take callable functions. The callable functions evaluate on cavalcade names to select that specific column where the function evaluates to True.

          # Read the csv file            with            columns            where            length            of            column name >            ten            df = pd.read_csv("data.csv", usecols=lambda 10: len(ten)>10) df.caput()

Selecting/skipping rows while reading CSV

You can skip or select a specific number of rows from the dataset using the pandas.read_csv part. There are iii parameters that can do this task: nrows, skiprows and skipfooter.

All of them have different functions. Let's discuss each of them separately.

A. nrows : This parameter allows yous to control how many rows you lot desire to load from the CSV file. Information technology takes an integer specifying row count.

                      # Read the csv file with 5 rows            df            = pd.read_csv("data.csv",            nrows=v)            df

B. skiprows : This parameter allows you to skip rows from the beginning of the file.

Skiprows by specifying row indices

                      # Read the csv file with offset row skipped            df            = pd.read_csv("data.csv",            skiprows=1)            df.head()

Skiprows by using callback role

skiprows parameter can likewise take a callable role as input which evaluates on row indices. This means the callable role will check for every row indices to decide if that row should exist skipped or non.

                      # Read the csv file with odd rows skipped            df            = pd.read_csv("information.csv",            skiprows=lambda            10: 10%2!=0) df.head()

C. skipfooter : This parameter allows you to skip rows from the end of the file.

                      # Read the csv file with 1 row skipped from the end            df            = pd.read_csv("data.csv",            skipfooter=i)            df.tail()

Changing the data type of columns

You can specify the information types of columns while reading the CSV file. dtype parameter takes in the dictionary of columns with their data types divers. To assign the data types, you tin import them from the numpy package and mention them confronting suitable columns.

Data Type of Rank before change

                      # Read the csv file                        df            = pd.read_csv("data.csv")            # Display datatype of Rank            df.Rank.dtypes

                                    dtype              ('int64')

Data Type of Rank after change

          #            import            numpy            import            numpy            as            np  #            Read            the csv file with information            blazon            specified for            Rank.            df            = pd.read_csv("data.csv", dtype={'Rank':np.int8})  #            Display            informationtype            of rank            df.Rank.dtypes

                                    dtype              ('int8')

Parse Dates while reading CSV

Date time values are very crucial for information analysis. You lot can catechumen a column to a datetime type column while reading the CSV in two ways:

Method 1. Make the desired column equally an index and pass parse_dates=True

          # Read the csv file            with            'Engagement'            every bit            alphabetize and parse_dates=True            df = pd.read_csv("data.csv", index_col='Engagement', parse_dates=Truthful, nrows=5)  # Display index df.index

          DatetimeIndex(['2021            -02            -25', '2021            -04            -14', '2021            -02            -19', '2021            -02            -24',                '2021            -02            -13'],               dtype='datetime64[ns]', name='Date', freq=None)

Method 2. Pass desired cavalcade in parse_dates every bit list

          # Read the csv file            with            parse_dates=['Engagement'] df = pd.read_csv("information.csv", parse_dates=['Date'], nrows=5)  # Display datatypes            of            columns df.dtypes

                      Rank            int64            Land                          object                        Population                          object                        National            Share            (%)                          object                        Date            datetime64[ns] dtype:                          object

Adding more NaN values

Pandas library can handle a lot of missing values. But there are many cases where the data contains missing values in forms that are not present in the pandas NA values list. Information technology doesn't understand 'missing', 'not institute', or 'non bachelor' as missing values.

And so, you need to assign them as missing. To practise this, use the na_values parameter that takes a list of such values.

Loading CSV without specifying na_values

                      # Read the csv file            df            = pd.read_csv("data.csv",            nrows=5)            df

Loading CSV with specifying na_values

          # Read the csv file            with            'missing'            every bit            na_values df = pd.read_csv("information.csv", na_values=['missing'], nrows=5) df

Convert values of the column while reading CSV

Yous can transform, alter, or convert the values of the columns of the CSV file while loading the CSV itself. This can be done by using the converters parameter. converters takes in a dictionary with keys as the cavalcade names and values are the functions to be applied to them.

Let's convert the comma seperated values (i.e 19,98,12,341) of the Population column in the dataset to integer value (199812341) while reading the CSV.

                      # Function which converts comma seperated value to integer            toInt = lambda ten:            int(x.supervene upon(',',            ''))            if            10!='missing'            else            -one            # Read the csv file                        df = pd.read_csv("information.csv", converters={'Population': toInt}) df.head()

Applied Tips

Before loading the CSV file into a pandas data frame, always take a skimmed look at the file. Information technology will help you estimate which columns you lot should import and make up one's mind what information types your columns should have.
You should too sentry for the total row count of the dataset. A system with 4 GB RAM may non be able to load vii-8M rows.

Test your noesis

Q1: You cannot load files with the $ separator using the pandas read_csv function. True or Imitation?

Reply:

Respond: False. Considering, y'all can employ sep parameter in read_csv function.

Q2: What is the utilise of the converters parameter in the read_csv role?

Reply:

Answer: converters parameter is used to modify the values of the columns while loading the CSV.

Q3: How will you make pandas recognize that a detail column is datetime type?

Answer:

Reply: By using parse_dates parameter.

Q4: A dataset contains missing values no, not available, and '-100'. How volition you specify them as missing values for Pandas to correctly interpret them? (Assume CSV file name: example1.csv)

Respond:

Answer: By using na_values parameter.

                          import              pandas              as              pd  df = pd.read_csv("example1.csv", na_values=['no',              'not bachelor',              '-100'])

Q5: How would yous read a CSV file where,

The heading of the columns is in the third row (numbered from i).
The concluding five lines of the file have garbage text and should exist avoided.
Only the column names whose first letter of the alphabet starts with vowels should exist included. Presume they are one discussion but.

(CSV file name: example2.csv)

Answer:

Answer:

                          import              pandas              every bit              pd  colnameWithVowels = lambda              x:              ten.lower()[0]              in              ['a',              'e',              'i',              'o',              'u']  df = pd.read_csv("example2.csv", usecols=colnameWithVowels, header=2, skipfooter=5)

The article was contributed by Kaustubh One thousand and Shrivarsheni

williamseimstand.blogspot.com

Source: https://www.machinelearningplus.com/pandas/pandas-read_csv-completed/