Nicolaescu, L.: the Index of Families of Boundary Value Problems for Dirac Operators

Working with missing data¶

In this department, nosotros volition discuss missing (as well referred to as NA) values in pandas.

Notation

The selection of using NaN internally to denote missing data was largely for simplicity and performance reasons. Starting from pandas 1.0, some optional data types start experimenting with a native NA scalar using a mask-based arroyo. See hither for more.

Run into the cookbook for some avant-garde strategies.

Values considered "missing"¶

As data comes in many shapes and forms, pandas aims to be flexible with regard to handling missing information. While NaN is the default missing value marker for reasons of computational speed and convenience, we need to be able to easily detect this value with data of different types: floating betoken, integer, boolean, and full general object. In many cases, however, the Python None will arise and we wish to likewise consider that "missing" or "not available" or "NA".

Notation

If you want to consider inf and -inf to be "NA" in computations, you lot tin gear up pandas.options.mode.use_inf_as_na = True .

                            In [1]:                            df              =              pd              .              DataFrame              (                              ...:                            np              .              random              .              randn              (              5              ,              three              ),                              ...:                            alphabetize              =              [              "a"              ,              "c"              ,              "e"              ,              "f"              ,              "h"              ],                              ...:                            columns              =              [              "one"              ,              "two"              ,              "3"              ],                              ...:                            )                              ...:                            In [ii]:                            df              [              "iv"              ]              =              "bar"              In [3]:                            df              [              "five"              ]              =              df              [              "i"              ]              >              0              In [4]:                            df              Out[4]:                                                          ane       two     iii 4   v              a  0.469112 -0.282863 -i.509059  bar   True              c -1.135632  ane.212112 -0.173215  bar  Imitation              e  0.119209 -1.044236 -0.861849  bar   Truthful              f -two.104569 -0.494929  one.071804  bar  Imitation              h  0.721555 -0.706771 -1.039575  bar   Truthful              In [v]:                            df2              =              df              .              reindex              ([              "a"              ,              "b"              ,              "c"              ,              "d"              ,              "e"              ,              "f"              ,              "thousand"              ,              "h"              ])              In [6]:                            df2              Out[half dozen]:                                                          i       two     iii 4   five              a  0.469112 -0.282863 -1.509059  bar   Truthful              b       NaN       NaN       NaN  NaN    NaN              c -1.135632  1.212112 -0.173215  bar  Fake              d       NaN       NaN       NaN  NaN    NaN              east  0.119209 -1.044236 -0.861849  bar   True              f -2.104569 -0.494929  1.071804  bar  False              g       NaN       NaN       NaN  NaN    NaN              h  0.721555 -0.706771 -1.039575  bar   True            

To make detecting missing values easier (and across unlike array dtypes), pandas provides the isna() and notna() functions, which are also methods on Serial and DataFrame objects:

                            In [vii]:                            df2              [              "one"              ]              Out[7]:                                          a    0.469112              b         NaN              c   -1.135632              d         NaN              e    0.119209              f   -2.104569              g         NaN              h    0.721555              Name: one, dtype: float64              In [8]:                            pd              .              isna              (              df2              [              "one"              ])              Out[8]:                                          a    Simulated              b     True              c    False              d     True              e    False              f    False              g     True              h    Fake              Name: one, dtype: bool              In [nine]:                            df2              [              "four"              ]              .              notna              ()              Out[ix]:                                          a     True              b    False              c     True              d    Fake              e     True              f     Truthful              g    False              h     True              Name: four, dtype: bool              In [10]:                            df2              .              isna              ()              Out[ten]:                                                          i    2  iii   four   five              a  Fake  False  False  Imitation  False              b   True   Truthful   Truthful   True   True              c  Imitation  Simulated  False  Fake  False              d   True   True   True   Truthful   True              e  False  Imitation  False  False  False              f  False  False  False  False  False              g   True   True   Truthful   True   True              h  Simulated  Faux  False  False  Imitation            

Alert

One has to be mindful that in Python (and NumPy), the nan's don't compare equal, only None's practise. Notation that pandas/NumPy uses the fact that np.nan != np.nan , and treats None like np.nan .

                                In [xi]:                                None                ==                None                # noqa: E711                Out[eleven]:                                Truthful                In [12]:                                np                .                nan                ==                np                .                nan                Out[12]:                                False              

So as compared to above, a scalar equality comparison versus a None/np.nan doesn't provide useful data.

                                In [13]:                                df2                [                "one"                ]                ==                np                .                nan                Out[xiii]:                                                a    False                b    False                c    Fake                d    Fake                e    Simulated                f    False                g    False                h    False                Name: i, dtype: bool              

Integer dtypes and missing data¶

Considering NaN is a float, a cavalcade of integers with fifty-fifty one missing values is cast to floating-point dtype (see Support for integer NA for more). pandas provides a nullable integer array, which can be used by explicitly requesting the dtype:

                                In [fourteen]:                                pd                .                Series                ([                1                ,                2                ,                np                .                nan                ,                4                ],                dtype                =                pd                .                Int64Dtype                ())                Out[14]:                                                0       i                1       2                2    <NA>                iii       4                dtype: Int64              

Alternatively, the string alias dtype='Int64' (notation the capital "I" ) can be used.

See Nullable integer information type for more than.

Datetimes¶

For datetime64[ns] types, NaT represents missing values. This is a pseudo-native sentinel value that can be represented by NumPy in a singular dtype (datetime64[ns]). pandas objects provide compatibility between NaT and NaN .

                                In [xv]:                                df2                =                df                .                copy                ()                In [16]:                                df2                [                "timestamp"                ]                =                pd                .                Timestamp                (                "20120101"                )                In [17]:                                df2                Out[17]:                                                                  1       two     iii four   five  timestamp                a  0.469112 -0.282863 -1.509059  bar   True 2012-01-01                c -one.135632  1.212112 -0.173215  bar  False 2012-01-01                e  0.119209 -1.044236 -0.861849  bar   Truthful 2012-01-01                f -2.104569 -0.494929  1.071804  bar  Faux 2012-01-01                h  0.721555 -0.706771 -1.039575  bar   Truthful 2012-01-01                In [18]:                                df2                .                loc                [[                "a"                ,                "c"                ,                "h"                ],                [                "one"                ,                "timestamp"                ]]                =                np                .                nan                In [19]:                                df2                Out[19]:                                                                  one       two     three four   five  timestamp                a       NaN -0.282863 -one.509059  bar   True        NaT                c       NaN  i.212112 -0.173215  bar  False        NaT                east  0.119209 -1.044236 -0.861849  bar   True 2012-01-01                f -2.104569 -0.494929  i.071804  bar  Simulated 2012-01-01                h       NaN -0.706771 -i.039575  bar   Truthful        NaT                In [20]:                                df2                .                dtypes                .                value_counts                ()                Out[20]:                                                float64           iii                object            1                bool              1                datetime64[ns]    i                dtype: int64              

Inserting missing information¶

You can insert missing values past simply assigning to containers. The actual missing value used will exist chosen based on the dtype.

For example, numeric containers will always use NaN regardless of the missing value type chosen:

                            In [21]:                            s              =              pd              .              Series              ([              i              ,              2              ,              3              ])              In [22]:                            southward              .              loc              [              0              ]              =              None              In [23]:                            s              Out[23]:                                          0    NaN              1    two.0              two    3.0              dtype: float64            

As well, datetime containers will always utilize NaT .

For object containers, pandas volition utilise the value given:

                            In [24]:                            due south              =              pd              .              Serial              ([              "a"              ,              "b"              ,              "c"              ])              In [25]:                            due south              .              loc              [              0              ]              =              None              In [26]:                            s              .              loc              [              1              ]              =              np              .              nan              In [27]:                            s              Out[27]:                                          0    None              1     NaN              2       c              dtype: object            

Calculations with missing data¶

Missing values propagate naturally through arithmetics operations between pandas objects.

                            In [28]:                            a              Out[28]:                                                          one       two              a       NaN -0.282863              c       NaN  1.212112              e  0.119209 -1.044236              f -2.104569 -0.494929              h -2.104569 -0.706771              In [29]:                            b              Out[29]:                                                          one       ii     three              a       NaN -0.282863 -1.509059              c       NaN  i.212112 -0.173215              due east  0.119209 -1.044236 -0.861849              f -two.104569 -0.494929  ane.071804              h       NaN -0.706771 -1.039575              In [30]:                            a              +              b              Out[30]:                                                          one  three       ii              a       NaN    NaN -0.565727              c       NaN    NaN  2.424224              e  0.238417    NaN -2.088472              f -four.209138    NaN -0.989859              h       NaN    NaN -i.413542            

The descriptive statistics and computational methods discussed in the data structure overview (and listed hither and here) are all written to account for missing data. For example:

  • When summing data, NA (missing) values will be treated equally zero.

  • If the information are all NA, the outcome will be 0.

  • Cumulative methods like cumsum() and cumprod() ignore NA values by default, simply preserve them in the resulting arrays. To override this behaviour and include NA values, employ skipna=Faux .

                            In [31]:                            df              Out[31]:                                                          1       2     iii              a       NaN -0.282863 -1.509059              c       NaN  i.212112 -0.173215              e  0.119209 -ane.044236 -0.861849              f -2.104569 -0.494929  i.071804              h       NaN -0.706771 -1.039575              In [32]:                            df              [              "one"              ]              .              sum              ()              Out[32]:                            -1.9853605075978744              In [33]:                            df              .              hateful              (              i              )              Out[33]:                                          a   -0.895961              c    0.519449              due east   -0.595625              f   -0.509232              h   -0.873173              dtype: float64              In [34]:                            df              .              cumsum              ()              Out[34]:                                                          one       2     three              a       NaN -0.282863 -1.509059              c       NaN  0.929249 -1.682273              eastward  0.119209 -0.114987 -2.544122              f -i.985361 -0.609917 -1.472318              h       NaN -1.316688 -2.511893              In [35]:                            df              .              cumsum              (              skipna              =              Faux              )              Out[35]:                                                          one       two     three              a  NaN -0.282863 -ane.509059              c  NaN  0.929249 -i.682273              e  NaN -0.114987 -ii.544122              f  NaN -0.609917 -i.472318              h  NaN -1.316688 -2.511893            

Sum/prod of empties/nans¶

Warning

This behavior is now standard as of v0.22.0 and is consistent with the default in numpy ; previously sum/prod of all-NA or empty Serial/DataFrames would return NaN. See v0.22.0 whatsnew for more.

The sum of an empty or all-NA Series or column of a DataFrame is 0.

                            In [36]:                            pd              .              Series              ([              np              .              nan              ])              .              sum              ()              Out[36]:                            0.0              In [37]:                            pd              .              Series              ([],              dtype              =              "float64"              )              .              sum              ()              Out[37]:                            0.0            

The product of an empty or all-NA Series or cavalcade of a DataFrame is ane.

                            In [38]:                            pd              .              Series              ([              np              .              nan              ])              .              prod              ()              Out[38]:                            1.0              In [39]:                            pd              .              Serial              ([],              dtype              =              "float64"              )              .              prod              ()              Out[39]:                            i.0            

NA values in GroupBy¶

NA groups in GroupBy are automatically excluded. This behavior is consequent with R, for example:

                            In [twoscore]:                            df              Out[40]:                                                          1       ii     three              a       NaN -0.282863 -1.509059              c       NaN  one.212112 -0.173215              e  0.119209 -1.044236 -0.861849              f -two.104569 -0.494929  ane.071804              h       NaN -0.706771 -ane.039575              In [41]:                            df              .              groupby              (              "ane"              )              .              mean              ()              Out[41]:                                                          two     three              1                            -2.104569 -0.494929  ane.071804                              0.119209 -1.044236 -0.861849            

See the groupby section here for more information.

Cleaning / filling missing data¶

pandas objects are equipped with various data manipulation methods for dealing with missing data.

Filling missing values: fillna¶

fillna() tin can "fill up in" NA values with non-NA data in a couple of ways, which we illustrate:

Replace NA with a scalar value

                            In [42]:                            df2              Out[42]:                                                          i       2     three iv   five  timestamp              a       NaN -0.282863 -1.509059  bar   True        NaT              c       NaN  1.212112 -0.173215  bar  Simulated        NaT              eastward  0.119209 -1.044236 -0.861849  bar   True 2012-01-01              f -2.104569 -0.494929  1.071804  bar  Fake 2012-01-01              h       NaN -0.706771 -one.039575  bar   Truthful        NaT              In [43]:                            df2              .              fillna              (              0              )              Out[43]:                                                          i       two     three iv   v            timestamp              a  0.000000 -0.282863 -i.509059  bar   True                    0              c  0.000000  one.212112 -0.173215  bar  False                    0              e  0.119209 -ane.044236 -0.861849  bar   True  2012-01-01 00:00:00              f -two.104569 -0.494929  1.071804  bar  Simulated  2012-01-01 00:00:00              h  0.000000 -0.706771 -1.039575  bar   Truthful                    0              In [44]:                            df2              [              "one"              ]              .              fillna              (              "missing"              )              Out[44]:                                          a     missing              c     missing              e    0.119209              f   -2.104569              h     missing              Proper name: one, dtype: object            

Make full gaps forward or astern

Using the same filling arguments every bit reindexing, we can propagate not-NA values forrard or backward:

                            In [45]:                            df              Out[45]:                                                          one       two     three              a       NaN -0.282863 -1.509059              c       NaN  ane.212112 -0.173215              e  0.119209 -1.044236 -0.861849              f -2.104569 -0.494929  1.071804              h       NaN -0.706771 -1.039575              In [46]:                            df              .              fillna              (              method              =              "pad"              )              Out[46]:                                                          ane       2     three              a       NaN -0.282863 -one.509059              c       NaN  ane.212112 -0.173215              e  0.119209 -1.044236 -0.861849              f -2.104569 -0.494929  ane.071804              h -two.104569 -0.706771 -1.039575            

Limit the amount of filling

If nosotros only desire consecutive gaps filled up to a certain number of data points, we can apply the limit keyword:

                            In [47]:                            df              Out[47]:                                                          i       2     three              a  NaN -0.282863 -1.509059              c  NaN  1.212112 -0.173215              e  NaN       NaN       NaN              f  NaN       NaN       NaN              h  NaN -0.706771 -1.039575              In [48]:                            df              .              fillna              (              method              =              "pad"              ,              limit              =              i              )              Out[48]:                                                          i       two     iii              a  NaN -0.282863 -1.509059              c  NaN  ane.212112 -0.173215              e  NaN  one.212112 -0.173215              f  NaN       NaN       NaN              h  NaN -0.706771 -1.039575            

To remind you, these are the available filling methods:

Method

Activity

pad / ffill

Fill values forward

bfill / backfill

Fill values backward

With time series data, using pad/ffill is extremely common then that the "terminal known value" is available at every fourth dimension point.

ffill() is equivalent to fillna(method='ffill') and bfill() is equivalent to fillna(method='bfill')

Filling with a PandasObject¶

You lot can also fillna using a dict or Series that is alignable. The labels of the dict or alphabetize of the Serial must match the columns of the frame you wish to make full. The use case of this is to fill a DataFrame with the mean of that column.

                            In [49]:                            dff              =              pd              .              DataFrame              (              np              .              random              .              randn              (              x              ,              three              ),              columns              =              list              (              "ABC"              ))              In [50]:                            dff              .              iloc              [              3              :              5              ,              0              ]              =              np              .              nan              In [51]:                            dff              .              iloc              [              4              :              6              ,              1              ]              =              np              .              nan              In [52]:                            dff              .              iloc              [              5              :              8              ,              2              ]              =              np              .              nan              In [53]:                            dff              Out[53]:                                                          A         B         C              0  0.271860 -0.424972  0.567020              1  0.276232 -i.087401 -0.673690              2  0.113648 -i.478427  0.524988              three       NaN  0.577046 -ane.715002              4       NaN       NaN -i.157892              5 -i.344312       NaN       NaN              half dozen -0.109050  1.643563       NaN              seven  0.357021 -0.674600       NaN              eight -0.968914 -1.294524  0.413738              9  0.276662 -0.472035 -0.013960              In [54]:                            dff              .              fillna              (              dff              .              mean              ())              Out[54]:                                                          A         B         C              0  0.271860 -0.424972  0.567020              one  0.276232 -1.087401 -0.673690              two  0.113648 -1.478427  0.524988              3 -0.140857  0.577046 -1.715002              4 -0.140857 -0.401419 -1.157892              5 -one.344312 -0.401419 -0.293543              half dozen -0.109050  1.643563 -0.293543              7  0.357021 -0.674600 -0.293543              8 -0.968914 -ane.294524  0.413738              9  0.276662 -0.472035 -0.013960              In [55]:                            dff              .              fillna              (              dff              .              hateful              ()[              "B"              :              "C"              ])              Out[55]:                                                          A         B         C              0  0.271860 -0.424972  0.567020              i  0.276232 -1.087401 -0.673690              2  0.113648 -1.478427  0.524988              3       NaN  0.577046 -i.715002              iv       NaN -0.401419 -1.157892              5 -1.344312 -0.401419 -0.293543              half dozen -0.109050  1.643563 -0.293543              7  0.357021 -0.674600 -0.293543              eight -0.968914 -1.294524  0.413738              9  0.276662 -0.472035 -0.013960            

Same upshot as above, but is aligning the 'fill' value which is a Serial in this case.

                            In [56]:                            dff              .              where              (              pd              .              notna              (              dff              ),              dff              .              hateful              (),              axis              =              "columns"              )              Out[56]:                                                          A         B         C              0  0.271860 -0.424972  0.567020              one  0.276232 -1.087401 -0.673690              two  0.113648 -one.478427  0.524988              3 -0.140857  0.577046 -ane.715002              four -0.140857 -0.401419 -1.157892              5 -1.344312 -0.401419 -0.293543              6 -0.109050  1.643563 -0.293543              7  0.357021 -0.674600 -0.293543              8 -0.968914 -1.294524  0.413738              9  0.276662 -0.472035 -0.013960            

Dropping axis labels with missing information: dropna¶

You may wish to only exclude labels from a information prepare which refer to missing data. To do this, use dropna() :

                            In [57]:                            df              Out[57]:                                                          one       two     3              a  NaN -0.282863 -1.509059              c  NaN  ane.212112 -0.173215              e  NaN  0.000000  0.000000              f  NaN  0.000000  0.000000              h  NaN -0.706771 -1.039575              In [58]:                            df              .              dropna              (              centrality              =              0              )              Out[58]:                                          Empty DataFrame              Columns: [one, two, three]              Index: []              In [59]:                            df              .              dropna              (              axis              =              1              )              Out[59]:                                                          two     three              a -0.282863 -ane.509059              c  ane.212112 -0.173215              e  0.000000  0.000000              f  0.000000  0.000000              h -0.706771 -one.039575              In [60]:                            df              [              "ane"              ]              .              dropna              ()              Out[60]:                            Series([], Proper name: one, dtype: float64)            

An equivalent dropna() is available for Series. DataFrame.dropna has considerably more options than Series.dropna, which can be examined in the API.

Interpolation¶

Both Serial and DataFrame objects have interpolate() that, by default, performs linear interpolation at missing data points.

                            In [61]:                            ts              Out[61]:                                          2000-01-31    0.469112              2000-02-29         NaN              2000-03-31         NaN              2000-04-28         NaN              2000-05-31         NaN                              ...                            2007-12-31   -6.950267              2008-01-31   -7.904475              2008-02-29   -6.441779              2008-03-31   -eight.184940              2008-04-thirty   -9.011531              Freq: BM, Length: 100, dtype: float64              In [62]:                            ts              .              count              ()              Out[62]:                            66              In [63]:                            ts              .              plot              ()              Out[63]:                            <AxesSubplot:>            
../_images/series_before_interpolate.png
                            In [64]:                            ts              .              interpolate              ()              Out[64]:                                          2000-01-31    0.469112              2000-02-29    0.434469              2000-03-31    0.399826              2000-04-28    0.365184              2000-05-31    0.330541                              ...                            2007-12-31   -6.950267              2008-01-31   -7.904475              2008-02-29   -6.441779              2008-03-31   -8.184940              2008-04-30   -9.011531              Freq: BM, Length: 100, dtype: float64              In [65]:                            ts              .              interpolate              ()              .              count              ()              Out[65]:                            100              In [66]:                            ts              .              interpolate              ()              .              plot              ()              Out[66]:                            <AxesSubplot:>            
../_images/series_interpolate.png

Alphabetize aware interpolation is available via the method keyword:

                            In [67]:                            ts2              Out[67]:                                          2000-01-31    0.469112              2000-02-29         NaN              2002-07-31   -5.785037              2005-01-31         NaN              2008-04-30   -nine.011531              dtype: float64              In [68]:                            ts2              .              interpolate              ()              Out[68]:                                          2000-01-31    0.469112              2000-02-29   -ii.657962              2002-07-31   -5.785037              2005-01-31   -7.398284              2008-04-30   -ix.011531              dtype: float64              In [69]:                            ts2              .              interpolate              (              method              =              "fourth dimension"              )              Out[69]:                                          2000-01-31    0.469112              2000-02-29    0.270241              2002-07-31   -5.785037              2005-01-31   -vii.190866              2008-04-30   -9.011531              dtype: float64            

For a floating-point index, use method='values' :

                            In [70]:                            ser              Out[70]:                                          0.0      0.0              1.0      NaN              10.0    x.0              dtype: float64              In [71]:                            ser              .              interpolate              ()              Out[71]:                                          0.0      0.0              1.0      v.0              10.0    x.0              dtype: float64              In [72]:                            ser              .              interpolate              (              method              =              "values"              )              Out[72]:                                          0.0      0.0              1.0      1.0              10.0    10.0              dtype: float64            

You can also interpolate with a DataFrame:

                            In [73]:                            df              =              pd              .              DataFrame              (                              ....:                            {                              ....:                            "A"              :              [              1              ,              ii.1              ,              np              .              nan              ,              iv.7              ,              5.6              ,              6.eight              ],                              ....:                            "B"              :              [              0.25              ,              np              .              nan              ,              np              .              nan              ,              4              ,              12.2              ,              fourteen.4              ],                              ....:                            }                              ....:                            )                              ....:                            In [74]:                            df              Out[74]:                                                          A      B              0  ane.0   0.25              1  ii.1    NaN              ii  NaN    NaN              3  4.seven   4.00              4  5.6  12.20              5  half dozen.8  14.twoscore              In [75]:                            df              .              interpolate              ()              Out[75]:                                                          A      B              0  i.0   0.25              i  2.1   1.l              2  3.4   2.75              3  4.vii   4.00              iv  v.6  12.20              v  6.viii  xiv.40            

The method statement gives access to fancier interpolation methods. If you take scipy installed, yous can pass the name of a 1-d interpolation routine to method . Y'all'll desire to consult the total scipy interpolation documentation and reference guide for details. The appropriate interpolation method volition depend on the type of data yous are working with.

  • If you are dealing with a time series that is growing at an increasing rate, method='quadratic' may be appropriate.

  • If yous have values approximating a cumulative distribution function, then method='pchip' should work well.

  • To fill missing values with goal of smoothen plotting, consider method='akima' .

Warning

These methods require scipy .

                            In [76]:                            df              .              interpolate              (              method              =              "barycentric"              )              Out[76]:                                                          A       B              0  1.00   0.250              1  2.ten  -7.660              2  3.53  -4.515              3  four.70   4.000              4  5.60  12.200              5  6.fourscore  14.400              In [77]:                            df              .              interpolate              (              method              =              "pchip"              )              Out[77]:                                                          A          B              0  ane.00000   0.250000              1  2.10000   0.672808              two  iii.43454   1.928950              3  4.70000   four.000000              4  5.60000  12.200000              5  vi.80000  14.400000              In [78]:                            df              .              interpolate              (              method              =              "akima"              )              Out[78]:                                                          A          B              0  one.000000   0.250000              1  2.100000  -0.873316              ii  iii.406667   0.320034              3  four.700000   iv.000000              4  5.600000  12.200000              5  6.800000  fourteen.400000            

When interpolating via a polynomial or spline approximation, you must also specify the degree or order of the approximation:

                            In [79]:                            df              .              interpolate              (              method              =              "spline"              ,              order              =              ii              )              Out[79]:                                                          A          B              0  1.000000   0.250000              i  two.100000  -0.428598              two  3.404545   1.206900              3  4.700000   4.000000              4  five.600000  12.200000              5  6.800000  14.400000              In [80]:                            df              .              interpolate              (              method              =              "polynomial"              ,              order              =              2              )              Out[80]:                                                          A          B              0  1.000000   0.250000              1  2.100000  -2.703846              2  3.451351  -1.453846              three  4.700000   iv.000000              four  v.600000  12.200000              5  half dozen.800000  xiv.400000            

Compare several methods:

                            In [81]:                            np              .              random              .              seed              (              ii              )              In [82]:                            ser              =              pd              .              Series              (              np              .              arange              (              one              ,              10.1              ,              0.25              )              **              2              +              np              .              random              .              randn              (              37              ))              In [83]:                            missing              =              np              .              array              ([              four              ,              thirteen              ,              xiv              ,              fifteen              ,              16              ,              17              ,              eighteen              ,              xx              ,              29              ])              In [84]:                            ser              [              missing              ]              =              np              .              nan              In [85]:                            methods              =              [              "linear"              ,              "quadratic"              ,              "cubic"              ]              In [86]:                            df              =              pd              .              DataFrame              ({              chiliad              :              ser              .              interpolate              (              method              =              chiliad              )              for              m              in              methods              })              In [87]:                            df              .              plot              ()              Out[87]:                            <AxesSubplot:>            
../_images/compare_interpolations.png

Another utilize case is interpolation at new values. Suppose you have 100 observations from some distribution. And permit's suppose that you lot're specially interested in what's happening around the middle. Y'all can mix pandas' reindex and interpolate methods to interpolate at the new values.

                            In [88]:                            ser              =              pd              .              Serial              (              np              .              sort              (              np              .              random              .              uniform              (              size              =              100              )))              # interpolate at new_index              In [89]:                            new_index              =              ser              .              index              .              union              (              pd              .              Alphabetize              ([              49.25              ,              49.5              ,              49.75              ,              fifty.25              ,              50.5              ,              50.75              ]))              In [ninety]:                            interp_s              =              ser              .              reindex              (              new_index              )              .              interpolate              (              method              =              "pchip"              )              In [91]:                            interp_s              [              49              :              51              ]              Out[91]:                                          49.00    0.471410              49.25    0.476841              49.50    0.481780              49.75    0.485998              50.00    0.489266              50.25    0.491814              l.50    0.493995              50.75    0.495763              51.00    0.497074              dtype: float64            

Interpolation limits¶

Similar other pandas make full methods, interpolate() accepts a limit keyword argument. Employ this argument to limit the number of sequent NaN values filled since the terminal valid observation:

                                In [92]:                                ser                =                pd                .                Serial                ([                np                .                nan                ,                np                .                nan                ,                five                ,                np                .                nan                ,                np                .                nan                ,                np                .                nan                ,                thirteen                ,                np                .                nan                ,                np                .                nan                ])                In [93]:                                ser                Out[93]:                                                0     NaN                1     NaN                2     5.0                3     NaN                4     NaN                5     NaN                6    thirteen.0                seven     NaN                8     NaN                dtype: float64                # make full all consecutive values in a frontward direction                In [94]:                                ser                .                interpolate                ()                Out[94]:                                                0     NaN                ane     NaN                2     five.0                3     seven.0                iv     nine.0                5    eleven.0                6    thirteen.0                seven    13.0                8    13.0                dtype: float64                # fill up one sequent value in a frontward direction                In [95]:                                ser                .                interpolate                (                limit                =                ane                )                Out[95]:                                                0     NaN                one     NaN                2     5.0                3     vii.0                4     NaN                5     NaN                6    thirteen.0                7    13.0                8     NaN                dtype: float64              

Past default, NaN values are filled in a forward direction. Use limit_direction parameter to fill backward or from both directions.

                                # fill one consecutive value backwards                In [96]:                                ser                .                interpolate                (                limit                =                1                ,                limit_direction                =                "astern"                )                Out[96]:                                                0     NaN                1     five.0                2     5.0                iii     NaN                4     NaN                five    eleven.0                6    13.0                7     NaN                8     NaN                dtype: float64                # fill up ane consecutive value in both directions                In [97]:                                ser                .                interpolate                (                limit                =                one                ,                limit_direction                =                "both"                )                Out[97]:                                                0     NaN                ane     5.0                two     five.0                iii     seven.0                4     NaN                5    11.0                6    xiii.0                7    13.0                8     NaN                dtype: float64                # fill all consecutive values in both directions                In [98]:                                ser                .                interpolate                (                limit_direction                =                "both"                )                Out[98]:                                                0     v.0                1     5.0                2     v.0                3     7.0                iv     nine.0                5    xi.0                vi    13.0                seven    thirteen.0                8    thirteen.0                dtype: float64              

By default, NaN values are filled whether they are inside (surrounded by) existing valid values, or exterior existing valid values. The limit_area parameter restricts filling to either inside or outside values.

                                # make full one sequent within value in both directions                In [99]:                                ser                .                interpolate                (                limit_direction                =                "both"                ,                limit_area                =                "inside"                ,                limit                =                1                )                Out[99]:                                                0     NaN                1     NaN                ii     5.0                3     7.0                4     NaN                5    11.0                6    13.0                seven     NaN                8     NaN                dtype: float64                # fill all consecutive outside values backward                In [100]:                                ser                .                interpolate                (                limit_direction                =                "backward"                ,                limit_area                =                "exterior"                )                Out[100]:                                                0     5.0                1     5.0                2     5.0                3     NaN                4     NaN                5     NaN                half-dozen    13.0                seven     NaN                eight     NaN                dtype: float64                # fill up all sequent outside values in both directions                In [101]:                                ser                .                interpolate                (                limit_direction                =                "both"                ,                limit_area                =                "outside"                )                Out[101]:                                                0     5.0                1     v.0                2     5.0                3     NaN                4     NaN                5     NaN                6    13.0                7    13.0                viii    13.0                dtype: float64              

Replacing generic values¶

Often times we want to replace arbitrary values with other values.

supersede() in Serial and supercede() in DataFrame provides an efficient yet flexible way to perform such replacements.

For a Series, you tin can supercede a single value or a list of values by another value:

                            In [102]:                            ser              =              pd              .              Series              ([              0.0              ,              1.0              ,              two.0              ,              3.0              ,              4.0              ])              In [103]:                            ser              .              supersede              (              0              ,              five              )              Out[103]:                                          0    5.0              ane    one.0              2    2.0              iii    3.0              four    4.0              dtype: float64            

Y'all can replace a listing of values by a listing of other values:

                            In [104]:                            ser              .              supervene upon              ([              0              ,              1              ,              2              ,              three              ,              4              ],              [              4              ,              three              ,              2              ,              i              ,              0              ])              Out[104]:                                          0    iv.0              ane    3.0              2    ii.0              3    1.0              iv    0.0              dtype: float64            

You can likewise specify a mapping dict:

                            In [105]:                            ser              .              replace              ({              0              :              10              ,              1              :              100              })              Out[105]:                                          0     x.0              1    100.0              ii      2.0              3      3.0              4      4.0              dtype: float64            

For a DataFrame, you can specify individual values by cavalcade:

                            In [106]:                            df              =              pd              .              DataFrame              ({              "a"              :              [              0              ,              ane              ,              2              ,              3              ,              4              ],              "b"              :              [              5              ,              6              ,              seven              ,              8              ,              9              ]})              In [107]:                            df              .              replace              ({              "a"              :              0              ,              "b"              :              v              },              100              )              Out[107]:                                                          a    b              0  100  100              one    ane    6              2    2    seven              3    three    viii              4    iv    9            

Instead of replacing with specified values, you can care for all given values equally missing and interpolate over them:

                            In [108]:                            ser              .              replace              ([              1              ,              2              ,              3              ],              method              =              "pad"              )              Out[108]:                                          0    0.0              1    0.0              2    0.0              three    0.0              4    iv.0              dtype: float64            

String/regular expression replacement¶

Notation

Python strings prefixed with the r character such every bit r'hullo world' are and then-called "raw" strings. They have different semantics regarding backslashes than strings without this prefix. Backslashes in raw strings volition be interpreted every bit an escaped backslash, e.chiliad., r'\' == '\\' . Y'all should read near them if this is unclear.

Replace the '.' with NaN (str -> str):

                            In [109]:                            d              =              {              "a"              :              list              (              range              (              iv              )),              "b"              :              list              (              "ab.."              ),              "c"              :              [              "a"              ,              "b"              ,              np              .              nan              ,              "d"              ]}              In [110]:                            df              =              pd              .              DataFrame              (              d              )              In [111]:                            df              .              replace              (              "."              ,              np              .              nan              )              Out[111]:                                                          a    b    c              0  0    a    a              1  one    b    b              2  ii  NaN  NaN              iii  iii  NaN    d            

At present do it with a regular expression that removes surrounding whitespace (regex -> regex):

                            In [112]:                            df              .              replace              (              r              "\s*\.\due south*"              ,              np              .              nan              ,              regex              =              True              )              Out[112]:                                                          a    b    c              0  0    a    a              1  1    b    b              ii  two  NaN  NaN              3  3  NaN    d            

Replace a few dissimilar values (list -> listing):

                            In [113]:                            df              .              replace              ([              "a"              ,              "."              ],              [              "b"              ,              np              .              nan              ])              Out[113]:                                                          a    b    c              0  0    b    b              1  1    b    b              ii  2  NaN  NaN              3  three  NaN    d            

list of regex -> list of regex:

                            In [114]:                            df              .              supercede              ([              r              "\."              ,              r              "(a)"              ],              [              "dot"              ,              r              "\1stuff"              ],              regex              =              True              )              Out[114]:                                                          a       b       c              0  0  astuff  astuff              i  1       b       b              2  2     dot     NaN              3  iii     dot       d            

Only search in column 'b' (dict -> dict):

                            In [115]:                            df              .              supplant              ({              "b"              :              "."              },              {              "b"              :              np              .              nan              })              Out[115]:                                                          a    b    c              0  0    a    a              one  1    b    b              2  2  NaN  NaN              three  three  NaN    d            

Same every bit the previous instance, but use a regular expression for searching instead (dict of regex -> dict):

                            In [116]:                            df              .              replace              ({              "b"              :              r              "\s*\.\s*"              },              {              "b"              :              np              .              nan              },              regex              =              True              )              Out[116]:                                                          a    b    c              0  0    a    a              1  ane    b    b              2  ii  NaN  NaN              3  3  NaN    d            

You can pass nested dictionaries of regular expressions that use regex=Truthful :

                            In [117]:                            df              .              supervene upon              ({              "b"              :              {              "b"              :              r              ""              }},              regex              =              True              )              Out[117]:                                                          a  b    c              0  0  a    a              1  one       b              2  2  .  NaN              3  3  .    d            

Alternatively, you can pass the nested dictionary like and so:

                            In [118]:                            df              .              replace              (              regex              =              {              "b"              :              {              r              "\south*\.\s*"              :              np              .              nan              }})              Out[118]:                                                          a    b    c              0  0    a    a              1  one    b    b              2  2  NaN  NaN              iii  3  NaN    d            

You lot tin can also use the group of a regular expression match when replacing (dict of regex -> dict of regex), this works for lists as well.

                            In [119]:                            df              .              replace              ({              "b"              :              r              "\s*(\.)\due south*"              },              {              "b"              :              r              "\1ty"              },              regex              =              True              )              Out[119]:                                                          a    b    c              0  0    a    a              one  1    b    b              2  2  .ty  NaN              iii  3  .ty    d            

Yous can pass a list of regular expressions, of which those that match will be replaced with a scalar (list of regex -> regex).

                            In [120]:                            df              .              supercede              ([              r              "\s*\.\s*"              ,              r              "a|b"              ],              np              .              nan              ,              regex              =              Truthful              )              Out[120]:                                                          a   b    c              0  0 NaN  NaN              i  1 NaN  NaN              2  two NaN  NaN              three  3 NaN    d            

All of the regular expression examples can also exist passed with the to_replace argument equally the regex argument. In this case the value argument must be passed explicitly by proper name or regex must be a nested dictionary. The previous example, in this example, would and then be:

                            In [121]:                            df              .              replace              (              regex              =              [              r              "\s*\.\s*"              ,              r              "a|b"              ],              value              =              np              .              nan              )              Out[121]:                                                          a   b    c              0  0 NaN  NaN              1  1 NaN  NaN              two  ii NaN  NaN              3  3 NaN    d            

This can be convenient if you do not want to pass regex=True every time you lot want to apply a regular expression.

Annotation

Anywhere in the above supplant examples that you run into a regular expression a compiled regular expression is valid also.

Numeric replacement¶

supervene upon() is similar to fillna() .

                            In [122]:                            df              =              pd              .              DataFrame              (              np              .              random              .              randn              (              10              ,              2              ))              In [123]:                            df              [              np              .              random              .              rand              (              df              .              shape              [              0              ])              >              0.v              ]              =              1.five              In [124]:                            df              .              replace              (              ane.5              ,              np              .              nan              )              Out[124]:                                                          0         1              0 -0.844214 -one.021415              one  0.432396 -0.323580              two  0.423825  0.799180              iii  1.262614  0.751965              four       NaN       NaN              5       NaN       NaN              half dozen -0.498174 -1.060799              7  0.591667 -0.183257              8  i.019855 -1.482465              nine       NaN       NaN            

Replacing more than one value is possible past passing a list.

                            In [125]:                            df00              =              df              .              iloc              [              0              ,              0              ]              In [126]:                            df              .              replace              ([              1.5              ,              df00              ],              [              np              .              nan              ,              "a"              ])              Out[126]:                                                          0         ane              0         a -1.021415              1  0.432396 -0.323580              2  0.423825  0.799180              iii  1.262614  0.751965              4       NaN       NaN              v       NaN       NaN              vi -0.498174 -1.060799              7  0.591667 -0.183257              8  1.019855 -ane.482465              9       NaN       NaN              In [127]:                            df              [              1              ]              .              dtype              Out[127]:                            dtype('float64')            

You tin also operate on the DataFrame in place:

                            In [128]:                            df              .              replace              (              i.5              ,              np              .              nan              ,              inplace              =              Truthful              )            

Missing information casting rules and indexing¶

While pandas supports storing arrays of integer and boolean type, these types are not capable of storing missing data. Until nosotros can switch to using a native NA type in NumPy, nosotros've established some "casting rules". When a reindexing operation introduces missing information, the Series will exist bandage according to the rules introduced in the table below.

data type

Bandage to

integer

bladder

boolean

object

bladder

no cast

object

no bandage

For case:

                                In [129]:                                s                =                pd                .                Series                (                np                .                random                .                randn                (                5                ),                index                =                [                0                ,                ii                ,                4                ,                half-dozen                ,                seven                ])                In [130]:                                southward                >                0                Out[130]:                                                0    True                2    True                four    Truthful                half-dozen    Truthful                7    Truthful                dtype: bool                In [131]:                                (                s                >                0                )                .                dtype                Out[131]:                                dtype('bool')                In [132]:                                crit                =                (                south                >                0                )                .                reindex                (                listing                (                range                (                8                )))                In [133]:                                crit                Out[133]:                                                0    True                1     NaN                ii    True                3     NaN                four    True                five     NaN                6    True                7    True                dtype: object                In [134]:                                crit                .                dtype                Out[134]:                                dtype('O')              

Unremarkably NumPy will complain if yous effort to use an object array (fifty-fifty if it contains boolean values) instead of a boolean array to get or set values from an ndarray (e.1000. selecting values based on some criteria). If a boolean vector contains NAs, an exception will be generated:

                                In [135]:                                reindexed                =                south                .                reindex                (                list                (                range                (                eight                )))                .                fillna                (                0                )                In [136]:                                reindexed                [                crit                ]                ---------------------------------------------------------------------------                ValueError                                Traceback (near recent phone call terminal)                Input                In                [                136                ],                in                <                module                >                ---->                                i                reindexed                [                crit                ]                File /pandas/pandas/core/serial.py:979,                in                Serial.__getitem__                (cocky, key)                                                976                if                is_iterator                (                key                ):                                                977                key                =                list                (                key                )                -->                                979                if                com                .                is_bool_indexer                (                primal                ):                                                980                fundamental                =                check_bool_indexer                (                self                .                alphabetize                ,                key                )                                                981                key                =                np                .                asarray                (                fundamental                ,                dtype                =                bool                )                File /pandas/pandas/cadre/mutual.py:144,                in                is_bool_indexer                (key)                                                140                na_msg                =                "Cannot mask with not-boolean array containing NA / NaN values"                                                141                if                lib                .                infer_dtype                (                central                )                ==                "boolean"                and                isna                (                key                )                .                whatever                ():                                                142                # Don't heighten on eastward.g. ["A", "B", np.nan], see                                                143                #  test_loc_getitem_list_of_labels_categoricalindex_with_na                -->                                144                raise                ValueError                (                na_msg                )                                                145                return                False                                                146                return                Truthful                ValueError: Cannot mask with non-boolean array containing NA / NaN values              

Withal, these can be filled in using fillna() and it will work fine:

                                In [137]:                                reindexed                [                crit                .                fillna                (                False                )]                Out[137]:                                                0    0.126504                ii    0.696198                four    0.697416                6    0.601516                vii    0.003659                dtype: float64                In [138]:                                reindexed                [                crit                .                fillna                (                True                )]                Out[138]:                                                0    0.126504                1    0.000000                2    0.696198                3    0.000000                4    0.697416                5    0.000000                6    0.601516                7    0.003659                dtype: float64              

pandas provides a nullable integer dtype, but you lot must explicitly request information technology when creating the series or column. Notice that nosotros use a capital "I" in the dtype="Int64" .

                                In [139]:                                s                =                pd                .                Series                ([                0                ,                1                ,                np                .                nan                ,                3                ,                four                ],                dtype                =                "Int64"                )                In [140]:                                s                Out[140]:                                                0       0                ane       1                two    <NA>                three       3                4       4                dtype: Int64              

See Nullable integer data blazon for more.

Experimental NA scalar to denote missing values¶

Alarm

Experimental: the behaviour of pd.NA can still change without alarm.

New in version ane.0.0.

Starting from pandas 1.0, an experimental pd.NA value (singleton) is available to correspond scalar missing values. At this moment, it is used in the nullable integer, boolean and defended string data types every bit the missing value indicator.

The goal of pd.NA is provide a "missing" indicator that can exist used consistently beyond data types (instead of np.nan , None or pd.NaT depending on the information type).

For example, when having missing values in a Serial with the nullable integer dtype, it volition use pd.NA :

                            In [141]:                            s              =              pd              .              Series              ([              i              ,              2              ,              None              ],              dtype              =              "Int64"              )              In [142]:                            southward              Out[142]:                                          0       one              1       2              two    <NA>              dtype: Int64              In [143]:                            south              [              2              ]              Out[143]:                            <NA>              In [144]:                            s              [              2              ]              is              pd              .              NA              Out[144]:                            Truthful            

Currently, pandas does not yet utilize those data types by default (when creating a DataFrame or Serial, or when reading in data), so you need to specify the dtype explicitly. An easy style to convert to those dtypes is explained here.

Propagation in arithmetic and comparison operations¶

In general, missing values propagate in operations involving pd.NA . When one of the operands is unknown, the issue of the operation is also unknown.

For example, pd.NA propagates in arithmetic operations, similarly to np.nan :

                                In [145]:                                pd                .                NA                +                1                Out[145]:                                <NA>                In [146]:                                "a"                *                pd                .                NA                Out[146]:                                <NA>              

There are a few special cases when the event is known, even when i of the operands is NA .

                                In [147]:                                pd                .                NA                **                0                Out[147]:                                1                In [148]:                                one                **                pd                .                NA                Out[148]:                                i              

In equality and comparison operations, pd.NA also propagates. This deviates from the behaviour of np.nan , where comparisons with np.nan always render Imitation .

                                In [149]:                                pd                .                NA                ==                1                Out[149]:                                <NA>                In [150]:                                pd                .                NA                ==                pd                .                NA                Out[150]:                                <NA>                In [151]:                                pd                .                NA                <                ii.5                Out[151]:                                <NA>              

To bank check if a value is equal to pd.NA , the isna() role can be used:

                                In [152]:                                pd                .                isna                (                pd                .                NA                )                Out[152]:                                True              

An exception on this basic propagation rule are reductions (such as the mean or the minimum), where pandas defaults to skipping missing values. See above for more.

Logical operations¶

For logical operations, pd.NA follows the rules of the 3-valued logic (or Kleene logic, similarly to R, SQL and Julia). This logic means to only propagate missing values when it is logically required.

For example, for the logical "or" operation ( | ), if i of the operands is Truthful , we already know the event will exist True , regardless of the other value (then regardless the missing value would be True or False ). In this case, pd.NA does not propagate:

                                In [153]:                                True                |                Imitation                Out[153]:                                True                In [154]:                                True                |                pd                .                NA                Out[154]:                                True                In [155]:                                pd                .                NA                |                Truthful                Out[155]:                                Truthful              

On the other mitt, if one of the operands is Simulated , the result depends on the value of the other operand. Therefore, in this case pd.NA propagates:

                                In [156]:                                False                |                True                Out[156]:                                Truthful                In [157]:                                False                |                False                Out[157]:                                Fake                In [158]:                                Simulated                |                pd                .                NA                Out[158]:                                <NA>              

The behaviour of the logical "and" operation ( & ) can exist derived using similar logic (where now pd.NA will not propagate if 1 of the operands is already Fake ):

                                In [159]:                                False                &                True                Out[159]:                                False                In [160]:                                False                &                False                Out[160]:                                Fake                In [161]:                                Faux                &                pd                .                NA                Out[161]:                                Imitation              
                                In [162]:                                True                &                True                Out[162]:                                True                In [163]:                                True                &                False                Out[163]:                                False                In [164]:                                Truthful                &                pd                .                NA                Out[164]:                                <NA>              

NA in a boolean context¶

Since the bodily value of an NA is unknown, it is cryptic to convert NA to a boolean value. The following raises an fault:

                                In [165]:                                bool                (                pd                .                NA                )                ---------------------------------------------------------------------------                TypeError                                Traceback (most recent call concluding)                Input                In                [                165                ],                in                <                module                >                ---->                                1                bool                (                pd                .                NA                )                File /pandas/pandas/_libs/missing.pyx:382,                in                pandas._libs.missing.NAType.__bool__                ()                TypeError: boolean value of NA is cryptic              

This also means that pd.NA cannot be used in a context where it is evaluated to a boolean, such as if condition: ... where condition can potentially be pd.NA . In such cases, isna() can exist used to check for pd.NA or condition existence pd.NA can be avoided, for example by filling missing values beforehand.

A similar situation occurs when using Series or DataFrame objects in if statements, meet Using if/truth statements with pandas.

NumPy ufuncs¶

pandas.NA implements NumPy'southward __array_ufunc__ protocol. Well-nigh ufuncs work with NA , and generally return NA :

                                In [166]:                                np                .                log                (                pd                .                NA                )                Out[166]:                                <NA>                In [167]:                                np                .                add together                (                pd                .                NA                ,                1                )                Out[167]:                                <NA>              

Warning

Currently, ufuncs involving an ndarray and NA will render an object-dtype filled with NA values.

                                    In [168]:                                    a                  =                  np                  .                  array                  ([                  ane                  ,                  2                  ,                  three                  ])                  In [169]:                                    np                  .                  greater                  (                  a                  ,                  pd                  .                  NA                  )                  Out[169]:                                    array([<NA>, <NA>, <NA>], dtype=object)                

The return type here may modify to render a different array type in the time to come.

See DataFrame interoperability with NumPy functions for more on ufuncs.

Conversion¶

If you have a DataFrame or Serial using traditional types that accept missing data represented using np.nan , at that place are convenience methods convert_dtypes() in Series and convert_dtypes() in DataFrame that tin can catechumen data to use the newer dtypes for integers, strings and booleans listed here. This is especially helpful after reading in data sets when letting the readers such equally read_csv() and read_excel() infer default dtypes.

In this example, while the dtypes of all columns are changed, we evidence the results for the beginning x columns.

                                In [170]:                                bb                =                pd                .                read_csv                (                "data/baseball game.csv"                ,                index_col                =                "id"                )                In [171]:                                bb                [                bb                .                columns                [:                x                ]]                .                dtypes                Out[171]:                                                player    object                twelvemonth       int64                stint      int64                team      object                lg        object                thou          int64                ab         int64                r          int64                h          int64                X2b        int64                dtype: object              
                                In [172]:                                bbn                =                bb                .                convert_dtypes                ()                In [173]:                                bbn                [                bbn                .                columns                [:                ten                ]]                .                dtypes                Out[173]:                                                player    string                year       Int64                stint      Int64                team      string                lg        string                g          Int64                ab         Int64                r          Int64                h          Int64                X2b        Int64                dtype: object              

hoffmanhipbres.blogspot.com

Source: https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html

0 Response to "Nicolaescu, L.: the Index of Families of Boundary Value Problems for Dirac Operators"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel