{"id":158100,"date":"2019-10-21T18:38:58","date_gmt":"2019-10-21T22:38:58","guid":{"rendered":"https:\/\/www.countingpips.com\/?p=158100"},"modified":"2019-10-21T18:41:22","modified_gmt":"2019-10-21T22:41:22","slug":"the-pandas-library-for-python","status":"publish","type":"post","link":"https:\/\/www.investmacro.com\/forex\/2019\/10\/the-pandas-library-for-python\/","title":{"rendered":"The Pandas Library for Python"},"content":{"rendered":"<div id=\"inves-1647017119\" class=\"inves-below-title-posts inves-entity-placement\"><div id =\"posts_date_custom\"><div align=\"left\">October 21, 2019<\/div><hr style=\"border: none; border-bottom: 3px solid black;\">\r\n<\/div><\/div><p><strong>By Zachary Wilson for <a href=\"https:\/\/kite.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Kite.com<\/a><\/strong><\/p>\n<div class=\"content-block\">\n<h3>Tables of Contents<\/h3>\n<ul>\n<li>Introduction to Pandas<\/li>\n<li>About the Data<\/li>\n<li>Setup<\/li>\n<li>Loading Data<\/li>\n<li>Basic Operations<\/li>\n<li>The Dtype<\/li>\n<li>Cleansing and Transforming Data<\/li>\n<li>Performing Basic Operations\n<ul>\n<li>Calculations<\/li>\n<li>Booleans<\/li>\n<li>Grouping<\/li>\n<li>Plotting<\/li>\n<\/ul>\n<\/li>\n<li>Exporting Transformed Data<\/li>\n<li>Final Notes<\/li>\n<\/ul>\n<h2><span id=\"introduction\" class=\"blog__contents__anchor\"><\/span>Introduction to Pandas<\/h2>\n<p>So, what is Pandas \u2013 practically speaking? In short, it\u2019s the major data analysis library for Python. For scientists, students, and professional developers alike, Pandas represents a central reason for any learning or interaction with Python, as opposed to a statistics-specific language like R, or a proprietary academic package like SPSS or Matlab. (Fun fact \u2013 Pandas is named after the term Panel Data, and was originally created for the analysis of financial data tables). I like to think that the final \u201cs\u201d stands for Series or Statistics.<\/p>\n<p>Although there are plenty of ways to explore numerical data with Python out-of-the box, these will universally involve some fairly low-performance results, with a ton of boilerplate. It may sound hard to believe, but Pandas is often recommended as the next stop for Excel users who are ready to take their data analysis to the next level. Nearly any problem that can be solved with a spreadsheet program can be solved in Pandas \u2013 without all the graphical cruft.<\/p>\n<p>More importantly, because problems can be solved in Pandas via Python, solutions are already automated, or could be run as a service in the cloud. Further, Pandas makes heavy use of Numpy, relying on its low level calls to produce linear math results orders of magnitude more quickly than they would be handled by Python alone. These are just a few of the reasons Pandas is recommended as one of the first libraries to learn for all Pythonistas, and remains absolutely critical to Data Scientists.<\/p>\n<h2><span id=\"about\" class=\"blog__contents__anchor\"><\/span>About the Data<\/h2>\n<p>In this post, we\u2019re going to be using a fascinating data set to demonstrate a useful slice of the Pandas library. This data set is particularly interesting as it\u2019s part of a real world example, and we can all imagine people lined up at an airport (a place where things do occasionally go wrong). When looking at the data, I imagine people people sitting in those uncomfortable airport seats having just found out that their luggage is missing \u2013 not just temporarily, but it\u2019s nowhere to be found in the system! Or, better yet, imagine that a hardworking TSA employee accidentally broke a precious family heirloom.<\/p>\n<p>So it\u2019s time to fill out another form, of course. Now, getting data from forms is an interesting process as far as data gathering is concerned, as we have a set of data that happens at specific times. This actually means we can interpret the entries as a Time Series. Also, because people are submitting the information, we can learn things about a group of people, too.<\/p>\n<p>Back to our example: let\u2019s say we work for the TSA and we\u2019ve been tasked with getting some insights about when these accidents are most likely to happen, and make some recommendations for improving the service.<\/p>\n<p>Pandas, luckily, is a one-stop shop for exploring and analyzing this data set. Feel free to download the excel file into your project folder to get started, or run the curl command below. Yes, pandas can read .xls or .xlsx files with a single call to\u00a0<code>pd.read_excel()<\/code>! In fact, it\u2019s often helpful for beginners experienced with .csv or excel files to think about how they would solve a problem in excel, and then experience how much easier it can be in Pandas.<\/p>\n<p>So, without further ado, open your terminal, a text editor, or your favorite IDE, and take a look for yourself with the guidance below.<\/p>\n<h3>Example data:<\/h3>\n<p>Take for example, some claims made against the TSA during a screening process of persons or a passenger\u2019s property due to an injury, loss, or damage. The claims data information includes claim number, incident date, claim type, claim amount, status, and disposition.<\/p>\n<p>Directory:\u00a0<a href=\"https:\/\/www.dhs.gov\/tsa-claims-data\" target=\"_blank\" rel=\"noopener noreferrer\">TSA Claims Data<\/a><br \/>\nOur Data Download:\u00a0<a href=\"https:\/\/www.dhs.gov\/sites\/default\/files\/publications\/claims-2014.xls\" target=\"_blank\" rel=\"noopener noreferrer\">claims-2014.xls<\/a><\/p>\n<h2><span id=\"setup\" class=\"blog__contents__anchor\"><\/span>Setup<\/h2>\n<p>To start off, let\u2019s create a clean directory. You can put this wherever you\u2019d like, or create a project folder in an IDE. Use your install method of choice to get Pandas: Pip is probably the easiest.<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\">$ mkdir -p ~\/Desktop\/pandas-tutorial\/data &amp;&amp; cd ~\/Desktop\/pandas-tutorial<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p>Install pandas along with xldr for loading Excel formatted files, matplotlib for plotting graphs, and Numpy for high-level mathematical functions.<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\">$ pip3 install matplotlib numpy pandas xldr<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p><strong><i>Optional:<\/i><\/strong>\u00a0download the example data with curl:<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\">$ curl -O https:\/\/www.dhs.gov\/sites\/default\/files\/publications\/claims<span class=\"hljs-number\">-2014.<\/span>xls<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p>Launch Python:<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\">$ python3\r\nPython <span class=\"hljs-number\">3.7<\/span><span class=\"hljs-number\">.1<\/span> (default, Nov  <span class=\"hljs-number\">6<\/span> <span class=\"hljs-number\">2018<\/span>, <span class=\"hljs-number\">18<\/span>:<span class=\"hljs-number\">46<\/span>:<span class=\"hljs-number\">03<\/span>)\r\n[Clang <span class=\"hljs-number\">10.0<\/span><span class=\"hljs-number\">.0<\/span> (clang<span class=\"hljs-number\">-1000.11<\/span><span class=\"hljs-number\">.45<\/span><span class=\"hljs-number\">.5<\/span>)] on darwin\r\nType <span class=\"hljs-string\">\"help\"<\/span>, <span class=\"hljs-string\">\"copyright\"<\/span>, <span class=\"hljs-string\">\"credits\"<\/span> <span class=\"hljs-keyword\">or<\/span> <span class=\"hljs-string\">\"license\"<\/span> <span class=\"hljs-keyword\">for<\/span> more information.\r\n&gt;&gt;&gt;<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p>Import packages:<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\"><span class=\"hljs-meta\">&gt;&gt;&gt; <\/span><span class=\"hljs-keyword\">import<\/span> matplotlib.pyplot <span class=\"hljs-keyword\">as<\/span> plt\r\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span><span class=\"hljs-keyword\">import<\/span> numpy <span class=\"hljs-keyword\">as<\/span> np\r\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span><span class=\"hljs-keyword\">import<\/span> pandas <span class=\"hljs-keyword\">as<\/span> pd<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<h2><span id=\"loading\" class=\"blog__contents__anchor\"><\/span>Loading Data<\/h2>\n<p>Loading data with Pandas is easy. Pandas can accurately read data from almost any common format including JSON, CSV, and SQL. Data is loaded into Pandas\u2019 \u201cflagship\u201d data structure, the DataFrame.<\/p>\n<p>That\u2019s a term you\u2019ll want to remember. You\u2019ll be hearing a lot about DataFrames. If that term seems confusing \u2013 think about a table in a database, or a sheet in Excel. The main point is that there is more than one column: each row or entry has multiple fields which are consistent from one row to the next.<\/p>\n<p>You can load the\u00a0<a href=\"https:\/\/catalog.data.gov\/dataset\/tsa-claims-data-2014\" target=\"_blank\" rel=\"noopener noreferrer\">example data<\/a>\u00a0straight from the web:<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\"><span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>df = pd.read_excel(io=<span class=\"hljs-string\">'https:\/\/www.dhs.gov\/sites\/default\/files\/publications\/claims-2014.xls'<\/span>, index_col=<span class=\"hljs-string\">'Claim Number'<\/span>)<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p>Less cooly, data can be loaded from a file:<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\">$ curl -O https:\/\/www.dhs.gov\/sites\/default\/files\/publications\/claims<span class=\"hljs-number\">-2014.<\/span>xls\r\n\r\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>df = pd.read_excel(io=<span class=\"hljs-string\">'claims-2014.xls'<\/span>, index_col=<span class=\"hljs-string\">'Claim Number'<\/span>)<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<h2><span id=\"operations\" class=\"blog__contents__anchor\"><\/span>Basic Operations<\/h2>\n<p>Print information about a DataFrame including the index dtype and column dtypes, non-null values, and memory usage. <code>DataFrame.info()<\/code> is one of the more useful and versatile methods attached to DataFrames (there are nearly 150!).<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\"><span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>df.info()\r\n <span class=\"hljs-string\">'pandas.core.frame.dataframe'<\/span>=<span class=\"hljs-string\">\"\"<\/span>&gt;\r\nInt64Index: <span class=\"hljs-number\">8855<\/span> entries, <span class=\"hljs-number\">2013081805991<\/span> to <span class=\"hljs-number\">2015012220083<\/span>\r\nData columns (total <span class=\"hljs-number\">10<\/span> columns):\r\nDate Received    <span class=\"hljs-number\">8855<\/span> non-null datetime64[ns]\r\nIncident Date    <span class=\"hljs-number\">8855<\/span> non-null datetime64[ns]\r\nAirport Code     <span class=\"hljs-number\">8855<\/span> non-null object\r\nAirport Name     <span class=\"hljs-number\">8855<\/span> non-null object\r\nAirline Name     <span class=\"hljs-number\">8855<\/span> non-null object\r\nClaim Type       <span class=\"hljs-number\">8855<\/span> non-null object\r\nClaim Site       <span class=\"hljs-number\">8855<\/span> non-null object\r\nItem Category    <span class=\"hljs-number\">8855<\/span> non-null object\r\nClose Amount     <span class=\"hljs-number\">8855<\/span> non-null object\r\nDisposition      <span class=\"hljs-number\">8855<\/span> non-null object\r\ndtypes: datetime64[ns](<span class=\"hljs-number\">2<\/span>), object(<span class=\"hljs-number\">8<\/span>)\r\nmemory usage: <span class=\"hljs-number\">761.0<\/span>+ KB<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p>View the first n rows:<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\"><span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>df.info()\r\n <span class=\"hljs-string\">'=\"\"&gt;&gt;&gt; df.head(n=3)  # see also df.tail()<\/span><span class=\"hljs-string\">\r\n    Claim Number Date Received       Incident Date Airport Code       ...              Claim Site                   Item Category Close Amount      Disposition<\/span><span class=\"hljs-string\">\r\n0  2013081805991    2014-01-13 2012-12-21 00:00:00          HPN       ...         Checked Baggage  Audio\/Video; Jewelry &amp; Watches            0             Deny<\/span><span class=\"hljs-string\">\r\n1  2014080215586    2014-07-17 2014-06-30 18:38:00          MCO       ...         Checked Baggage                               -            0             Deny<\/span><span class=\"hljs-string\">\r\n2  2014010710583    2014-01-07 2013-12-27 22:00:00          SJU       ...         Checked Baggage                    Food &amp; Drink           50  Approve in Full<\/span>\r\n<span class=\"hljs-string\">\r\n[3 rows x 11 columns]<\/span><\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p>List all the columns in the DataFrame:<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\"><span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>df.columns\r\nIndex([<span class=\"hljs-string\">'Claim Number'<\/span>, <span class=\"hljs-string\">'Date Received'<\/span>, <span class=\"hljs-string\">'Incident Date'<\/span>, <span class=\"hljs-string\">'Airport Code'<\/span>,\r\n       <span class=\"hljs-string\">'Airport Name'<\/span>, <span class=\"hljs-string\">'Airline Name'<\/span>, <span class=\"hljs-string\">'Claim Type'<\/span>, <span class=\"hljs-string\">'Claim Site'<\/span>,\r\n       <span class=\"hljs-string\">'Item Category'<\/span>, <span class=\"hljs-string\">'Close Amount'<\/span>, <span class=\"hljs-string\">'Disposition'<\/span>],\r\n      dtype=<span class=\"hljs-string\">'object'<\/span>)<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p>Return a single column (important \u2013 also referred to as a\u00a0<strong><i>Series<\/i><\/strong>):<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\"><span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>df[<span class=\"hljs-string\">'Claim Type'<\/span>].head()\r\n<span class=\"hljs-number\">0<\/span>    Personal Injury\r\n<span class=\"hljs-number\">1<\/span>    Property Damage\r\n<span class=\"hljs-number\">2<\/span>    Property Damage\r\n<span class=\"hljs-number\">3<\/span>    Property Damage\r\n<span class=\"hljs-number\">4<\/span>    Property Damage\r\nName: Claim Type, dtype: object<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p>Hopefully, you\u2019re starting to get an idea of what claims-2014.xls\u2019s data is all about.<\/p>\n<\/div>\n<div class=\"content-block\">\n<h2><span id=\"dtype\" class=\"blog__contents__anchor\"><\/span>The <code>Dtype<\/code><\/h2>\n<p>Data types are a fundamental concept that you\u2019ll want to have a solid grasp of in order to avoid frustration later. Pandas adopts the nomenclature of Numpy, referring to a column\u2019s data type as its <code>dtype<\/code>. Pandas also attempts to infer <code>dtypes<\/code> upon DataFrame construction (i.e. initialization).<\/p>\n<p>To take advantage of the performance boosts intrinsic to Numpy, we need to become familiar with these types, and learn about how they roughly translate to native Python types.<\/p>\n<p>Look again at\u00a0<code>df.info()<\/code>\u00a0and note the <code>dtype<\/code> assigned to each column of our DataFrame:<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\"><span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>df.info()\r\n <span class=\"hljs-string\">'pandas.core.frame.dataframe'<\/span>=<span class=\"hljs-string\">\"\"<\/span>&gt;\r\nRangeIndex: <span class=\"hljs-number\">8855<\/span> entries, <span class=\"hljs-number\">0<\/span> to <span class=\"hljs-number\">8854<\/span>\r\nData columns (total <span class=\"hljs-number\">11<\/span> columns):\r\nDate Received    <span class=\"hljs-number\">8855<\/span> non-null datetime64[ns]\r\nIncident Date    <span class=\"hljs-number\">8855<\/span> non-null datetime64[ns]\r\nAirport Code     <span class=\"hljs-number\">8855<\/span> non-null object\r\nAirport Name     <span class=\"hljs-number\">8855<\/span> non-null object\r\nAirline Name     <span class=\"hljs-number\">8855<\/span> non-null object\r\nClaim Type       <span class=\"hljs-number\">8855<\/span> non-null object\r\nClaim Site       <span class=\"hljs-number\">8855<\/span> non-null object\r\nItem Category    <span class=\"hljs-number\">8855<\/span> non-null object\r\nClose Amount     <span class=\"hljs-number\">8855<\/span> non-null object\r\nDisposition      <span class=\"hljs-number\">8855<\/span> non-null object\r\ndtypes: datetime64[ns](<span class=\"hljs-number\">2<\/span>), object(<span class=\"hljs-number\">8<\/span>)\r\nmemory usage: <span class=\"hljs-number\">761.1<\/span>+ KB<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p><code>dtypes<\/code> are analogous to text\/number format settings typical of most spreadsheet applications, and Pandas uses <code>dtypes<\/code> to determine which kind(s) of operations may be performed the data in a specific column. For example, mathematical operations can only be performed on numeric data types such as int64 or float64. Columns containing\u00a0<i>valid<\/i>\u00a0Dates and\/or time values are assigned the datetime <code>dtype<\/code> and text and or binary data is assigned the catchall object <code>dtype<\/code>.<\/p>\n<p>In short, Pandas attempts to infer <code>dtypes<\/code> upon DataFrame construction. However, like many data analysis applications, the process isn\u2019t always perfect.<\/p>\n<p>It\u2019s important to note that Pandas <code>dtype<\/code> inference errs on the side of caution: if a Series appears to contain more than one type of data, it\u2019s assigned a catch-all <code>dtype<\/code> of <code>\u2018object\u2019<\/code>. This behavior is less flexible than a typical spreadsheet application and is intended to ensure <code>dtypes<\/code> are not inferred incorrectly but also requires the analyst to ensure the data is \u201cclean\u201d after it\u2019s loaded.<\/p>\n<h2><span id=\"cleansing\" class=\"blog__contents__anchor\"><\/span>Cleansing and Transforming Data<\/h2>\n<p>Data is almost always dirty: it almost always contains some datum with atypical formatting; some artifact unique to its medium of origin. Therefore, cleansing data is crucial to ensuring analysis derived therefrom is sound. The work of cleansing with Pandas primarily involves identifying and re-casting incorrectly inferred <code>dtypes<\/code>.<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\"><span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>df.dtypes\r\nDate Received    datetime64[ns]\r\nIncident Date    datetime64[ns]\r\nAirport Code             object\r\nAirport Name             object\r\nAirline Name             object\r\nClaim Type               object\r\nClaim Site               object\r\nItem Category            object\r\nClose Amount             object\r\nDisposition              object\r\ndtype: object<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p>Looking again at our DataFrame\u2019s <code>dtypes<\/code> we can see that Pandas correctly inferred the <code>dtypes<\/code> of Date Received and Incident Date as datetime64 <code>dtypes<\/code>. Thus, datetime attributes of the column\u2019s data are accessible during operations. For example, to summarize our data by the hour of the day when each incident occurred we can group and summarize our data by the hour element of a datetime64 column to determine which hours of the day certain types of incidents occur.<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\"><span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>grp = df.groupby(by=df[<span class=\"hljs-string\">'Incident Date'<\/span>].dt.hour)\r\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>grp[<span class=\"hljs-string\">'Item Category'<\/span>].describe()\r\n              count unique                   top freq\r\nIncident Date\r\n<span class=\"hljs-number\">0<\/span>              <span class=\"hljs-number\">3421<\/span>    <span class=\"hljs-number\">146<\/span>  Baggage\/Cases\/Purses  <span class=\"hljs-number\">489<\/span>\r\n<span class=\"hljs-number\">1<\/span>                 <span class=\"hljs-number\">6<\/span>      <span class=\"hljs-number\">5<\/span>                 Other    <span class=\"hljs-number\">2<\/span>\r\n<span class=\"hljs-number\">2<\/span>                <span class=\"hljs-number\">11<\/span>      <span class=\"hljs-number\">9<\/span>                     -    <span class=\"hljs-number\">2<\/span>\r\n<span class=\"hljs-number\">3<\/span>                 <span class=\"hljs-number\">5<\/span>      <span class=\"hljs-number\">5<\/span>     Jewelry &amp; Watches    <span class=\"hljs-number\">1<\/span>\r\n<span class=\"hljs-number\">4<\/span>                <span class=\"hljs-number\">49<\/span>     <span class=\"hljs-number\">18<\/span>  Baggage\/Cases\/Purses    <span class=\"hljs-number\">6<\/span>\r\n<span class=\"hljs-number\">5<\/span>               <span class=\"hljs-number\">257<\/span>     <span class=\"hljs-number\">39<\/span>                     -   <span class=\"hljs-number\">33<\/span>\r\n<span class=\"hljs-number\">6<\/span>               <span class=\"hljs-number\">357<\/span>     <span class=\"hljs-number\">54<\/span>                     -   <span class=\"hljs-number\">43<\/span>\r\n<span class=\"hljs-number\">7<\/span>               <span class=\"hljs-number\">343<\/span>     <span class=\"hljs-number\">43<\/span>              Clothing   <span class=\"hljs-number\">41<\/span>\r\n<span class=\"hljs-number\">8<\/span>               <span class=\"hljs-number\">299<\/span>     <span class=\"hljs-number\">47<\/span>                     -   <span class=\"hljs-number\">35<\/span>\r\n<span class=\"hljs-number\">9<\/span>               <span class=\"hljs-number\">305<\/span>     <span class=\"hljs-number\">41<\/span>                     -   <span class=\"hljs-number\">31<\/span>\r\n<span class=\"hljs-number\">10<\/span>              <span class=\"hljs-number\">349<\/span>     <span class=\"hljs-number\">45<\/span>                 Other   <span class=\"hljs-number\">43<\/span>\r\n<span class=\"hljs-number\">11<\/span>              <span class=\"hljs-number\">343<\/span>     <span class=\"hljs-number\">41<\/span>                     -   <span class=\"hljs-number\">45<\/span>\r\n<span class=\"hljs-number\">12<\/span>              <span class=\"hljs-number\">363<\/span>     <span class=\"hljs-number\">51<\/span>                 Other   <span class=\"hljs-number\">41<\/span>\r\n<span class=\"hljs-number\">13<\/span>              <span class=\"hljs-number\">359<\/span>     <span class=\"hljs-number\">55<\/span>                     -   <span class=\"hljs-number\">45<\/span>\r\n<span class=\"hljs-number\">14<\/span>              <span class=\"hljs-number\">386<\/span>     <span class=\"hljs-number\">60<\/span>  Baggage\/Cases\/Purses   <span class=\"hljs-number\">49<\/span>\r\n<span class=\"hljs-number\">15<\/span>              <span class=\"hljs-number\">376<\/span>     <span class=\"hljs-number\">51<\/span>                 Other   <span class=\"hljs-number\">41<\/span>\r\n<span class=\"hljs-number\">16<\/span>              <span class=\"hljs-number\">351<\/span>     <span class=\"hljs-number\">43<\/span>  Personal Electronics   <span class=\"hljs-number\">35<\/span>\r\n<span class=\"hljs-number\">17<\/span>              <span class=\"hljs-number\">307<\/span>     <span class=\"hljs-number\">52<\/span>                 Other   <span class=\"hljs-number\">34<\/span>\r\n<span class=\"hljs-number\">18<\/span>              <span class=\"hljs-number\">289<\/span>     <span class=\"hljs-number\">43<\/span>  Baggage\/Cases\/Purses   <span class=\"hljs-number\">37<\/span>\r\n<span class=\"hljs-number\">19<\/span>              <span class=\"hljs-number\">241<\/span>     <span class=\"hljs-number\">46<\/span>  Baggage\/Cases\/Purses   <span class=\"hljs-number\">26<\/span>\r\n<span class=\"hljs-number\">20<\/span>              <span class=\"hljs-number\">163<\/span>     <span class=\"hljs-number\">31<\/span>  Baggage\/Cases\/Purses   <span class=\"hljs-number\">23<\/span>\r\n<span class=\"hljs-number\">21<\/span>              <span class=\"hljs-number\">104<\/span>     <span class=\"hljs-number\">32<\/span>  Baggage\/Cases\/Purses   <span class=\"hljs-number\">20<\/span>\r\n<span class=\"hljs-number\">22<\/span>              <span class=\"hljs-number\">106<\/span>     <span class=\"hljs-number\">33<\/span>  Baggage\/Cases\/Purses   <span class=\"hljs-number\">19<\/span>\r\n<span class=\"hljs-number\">23<\/span>               <span class=\"hljs-number\">65<\/span>     <span class=\"hljs-number\">25<\/span>  Baggage\/Cases\/Purses   <span class=\"hljs-number\">14<\/span><\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p>This works out quite perfectly \u2013 however, note that Close Amount was loaded as an <code>object<\/code>. Words like \u201cAmount\u201d are a good indicator that a column contains numeric values.<\/p>\n<p>Let\u2019s take a look at the values in Close Amount.<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\"><span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>df[<span class=\"hljs-string\">'Close Amount'<\/span>].head()\r\n<span class=\"hljs-number\">0<\/span>     <span class=\"hljs-number\">0<\/span>\r\n<span class=\"hljs-number\">1<\/span>     <span class=\"hljs-number\">0<\/span>\r\n<span class=\"hljs-number\">2<\/span>    <span class=\"hljs-number\">50<\/span>\r\n<span class=\"hljs-number\">3<\/span>     <span class=\"hljs-number\">0<\/span>\r\n<span class=\"hljs-number\">4<\/span>     <span class=\"hljs-number\">0<\/span>\r\nName: Close Amount, dtype: object<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p>Those look like numeric values to me. So let\u2019s take a look at the other end<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\"><span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>df[<span class=\"hljs-string\">'Close Amount'<\/span>].tail()\r\n<span class=\"hljs-number\">8850<\/span>      <span class=\"hljs-number\">0<\/span>\r\n<span class=\"hljs-number\">8851<\/span>    <span class=\"hljs-number\">800<\/span>\r\n<span class=\"hljs-number\">8852<\/span>      <span class=\"hljs-number\">0<\/span>\r\n<span class=\"hljs-number\">8853<\/span>    <span class=\"hljs-number\">256<\/span>\r\n<span class=\"hljs-number\">8854<\/span>      -\r\nName: Close Amount, dtype: object<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p>There\u2019s the culprit: index # 8854 is a string value.<\/p>\n<p>If Pandas can\u2019t\u00a0<i>objectively<\/i>\u00a0determine that all of the values contained in a DataFrame column are the same numeric or date\/time <code>dtype<\/code>, it defaults to an object.<\/p>\n<p>Luckily, I know from experience that Excel\u2019s \u201cAccounting\u201d number format typically formats 0.00 as a dash, -.<\/p>\n<p>So how do we fix this? Pandas provides a general method, DataFrame.apply, which can be used to apply any single-argument function to each value of one or more of its columns.<\/p>\n<p>In this case, we\u2019ll use it to simultaneously convert the \u2013 to the value it represents in Excel, 0.0 and re-cast the entire column\u2019s initial object <code>dtype<\/code> to its correct <code>dtype<\/code> a float64.<\/p>\n<p>First, we\u2019ll define a new function to perform the conversion:<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\"><span class=\"hljs-meta\">&gt;&gt;&gt; <\/span><span class=\"hljs-function\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title\">dash_to_zero<\/span><span class=\"hljs-params\">(x)<\/span>:<\/span>\r\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>   <span class=\"hljs-keyword\">if<\/span> <span class=\"hljs-string\">'-'<\/span> <span class=\"hljs-keyword\">in<\/span> str(x):\r\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>       <span class=\"hljs-keyword\">return<\/span> float() <span class=\"hljs-comment\"># 0.0<\/span>\r\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>   <span class=\"hljs-keyword\">else<\/span>:\r\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>       <span class=\"hljs-keyword\">return<\/span> x  <span class=\"hljs-comment\"># just return the input value as-is<\/span><\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p>Then, we\u2019ll apply the function to each value of Close Amount:<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\"><span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>df[<span class=\"hljs-string\">'Close Amount'<\/span>] = df[<span class=\"hljs-string\">'Close Amount'<\/span>].apply(dash_to_zero)\r\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>df[<span class=\"hljs-string\">'Close Amount'<\/span>].dtype\r\ndtype(<span class=\"hljs-string\">'float64'<\/span>)<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p>These two steps can also be combined into a single-line operation using Python\u2019s lambda:<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\"><span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>df[<span class=\"hljs-string\">'Close Amount'<\/span>].apply(<span class=\"hljs-keyword\">lambda<\/span> x: <span class=\"hljs-number\">0.<\/span> <span class=\"hljs-keyword\">if<\/span> <span class=\"hljs-string\">'-'<\/span> <span class=\"hljs-keyword\">in<\/span> str(x) <span class=\"hljs-keyword\">else<\/span> x)<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<h2><span id=\"performing\" class=\"blog__contents__anchor\"><\/span>Performing Basic Analysis<\/h2>\n<p>Once you\u2019re confident that your dataset is \u201cclean,\u201d you\u2019re ready for some data analysis! Aggregation is the process of getting summary data that may be more useful than the finely grained values we are given to start with.<\/p>\n<h3><span id=\"calculations\" class=\"blog__contents__anchor\"><\/span>Calculations<\/h3>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\"><span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>df.sum()\r\nClose Amount    <span class=\"hljs-number\">538739.51<\/span>\r\ndtype: float64\r\n\r\n\r\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>df.min()\r\nDate Received              <span class=\"hljs-number\">2014<\/span><span class=\"hljs-number\">-01<\/span><span class=\"hljs-number\">-01<\/span> <span class=\"hljs-number\">00<\/span>:<span class=\"hljs-number\">00<\/span>:<span class=\"hljs-number\">00<\/span>\r\nIncident Date              <span class=\"hljs-number\">2011<\/span><span class=\"hljs-number\">-08<\/span><span class=\"hljs-number\">-24<\/span> <span class=\"hljs-number\">08<\/span>:<span class=\"hljs-number\">30<\/span>:<span class=\"hljs-number\">00<\/span>\r\nAirport Code                                 -\r\nAirport Name      Albert J Ellis, Jacksonville\r\nAirline Name                                 -\r\nClaim Type                                   -\r\nClaim Site                                   -\r\nItem Category                                -\r\nClose Amount                                 <span class=\"hljs-number\">0<\/span>\r\nDisposition                                  -\r\n\r\n\r\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>df.max()\r\nDate Received                       <span class=\"hljs-number\">2014<\/span><span class=\"hljs-number\">-12<\/span><span class=\"hljs-number\">-31<\/span> <span class=\"hljs-number\">00<\/span>:<span class=\"hljs-number\">00<\/span>:<span class=\"hljs-number\">00<\/span>\r\nIncident Date                       <span class=\"hljs-number\">2014<\/span><span class=\"hljs-number\">-12<\/span><span class=\"hljs-number\">-31<\/span> <span class=\"hljs-number\">00<\/span>:<span class=\"hljs-number\">00<\/span>:<span class=\"hljs-number\">00<\/span>\r\nAirport Code                                        ZZZ\r\nAirport Name                 Yuma International Airport\r\nAirline Name                                 XL Airways\r\nClaim Type                              Property Damage\r\nClaim Site                                        Other\r\nItem Category    Travel Accessories; Travel Accessories\r\nClose Amount                                    <span class=\"hljs-number\">25483.4<\/span>\r\nDisposition                                      Settle\r\ndtype: object<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<h3><span id=\"booleans\" class=\"blog__contents__anchor\"><\/span>Booleans<\/h3>\n<p>Find all of the rows where <code>Close Amount<\/code> is greater than zero. This is helpful because we\u2019d like to see some patterins where the amount is actually positive, and show how conditional operators work.<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\"><span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>df[df[<span class=\"hljs-string\">'Close Amount'<\/span>] &gt; <span class=\"hljs-number\">0<\/span>].describe()\r\n       Close Amount\r\ncount   <span class=\"hljs-number\">2360.000000<\/span>\r\nmean     <span class=\"hljs-number\">228.279453<\/span>\r\nstd      <span class=\"hljs-number\">743.720179<\/span>\r\nmin        <span class=\"hljs-number\">1.250000<\/span>\r\n<span class=\"hljs-number\">25<\/span>%       <span class=\"hljs-number\">44.470000<\/span>\r\n<span class=\"hljs-number\">50<\/span>%      <span class=\"hljs-number\">100.000000<\/span>\r\n<span class=\"hljs-number\">75<\/span>%      <span class=\"hljs-number\">240.942500<\/span>\r\nmax    <span class=\"hljs-number\">25483.440000<\/span><\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<h3><span id=\"grouping\" class=\"blog__contents__anchor\"><\/span>Grouping<\/h3>\n<p>In this example, we\u2019ll walk through how to group by a single column\u2019s values.<\/p>\n<p>The Groupby object is an intermediate step that allows us to aggregate on several rows which share something in common \u2013 in this case, the disposition value. This is useful because we get a birds-eye view of different categories of data. Ultimately, we use <code>describe()<\/code> to see several aggregates at once.<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\"><span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>grp = df.groupby(by=<span class=\"hljs-string\">'Disposition'<\/span>)\r\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>grp.describe()\r\n                Close Amount\r\n                       count        mean          std   min       <span class=\"hljs-number\">25<\/span>%      <span class=\"hljs-number\">50<\/span>%       <span class=\"hljs-number\">75<\/span>%       max\r\nDisposition\r\n-                     <span class=\"hljs-number\">3737.0<\/span>    <span class=\"hljs-number\">0.000000<\/span>     <span class=\"hljs-number\">0.000000<\/span>  <span class=\"hljs-number\">0.00<\/span>    <span class=\"hljs-number\">0.0000<\/span>    <span class=\"hljs-number\">0.000<\/span>    <span class=\"hljs-number\">0.0000<\/span>      <span class=\"hljs-number\">0.00<\/span>\r\nApprove <span class=\"hljs-keyword\">in<\/span> Full       <span class=\"hljs-number\">1668.0<\/span>  <span class=\"hljs-number\">158.812116<\/span>   <span class=\"hljs-number\">314.532028<\/span>  <span class=\"hljs-number\">1.25<\/span>   <span class=\"hljs-number\">32.9625<\/span>   <span class=\"hljs-number\">79.675<\/span>  <span class=\"hljs-number\">159.3375<\/span>   <span class=\"hljs-number\">6183.36<\/span>\r\nDeny                  <span class=\"hljs-number\">2758.0<\/span>    <span class=\"hljs-number\">0.000000<\/span>     <span class=\"hljs-number\">0.000000<\/span>  <span class=\"hljs-number\">0.00<\/span>    <span class=\"hljs-number\">0.0000<\/span>    <span class=\"hljs-number\">0.000<\/span>    <span class=\"hljs-number\">0.0000<\/span>      <span class=\"hljs-number\">0.00<\/span>\r\nSettle                 <span class=\"hljs-number\">692.0<\/span>  <span class=\"hljs-number\">395.723844<\/span>  <span class=\"hljs-number\">1268.818458<\/span>  <span class=\"hljs-number\">6.00<\/span>  <span class=\"hljs-number\">100.0000<\/span>  <span class=\"hljs-number\">225.000<\/span>  <span class=\"hljs-number\">425.6100<\/span>  <span class=\"hljs-number\">25483.44<\/span><\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<p>Group by multiple columns:<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\"><span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>grp = df.groupby(by=[<span class=\"hljs-string\">'Disposition'<\/span>, <span class=\"hljs-string\">'Claim Site'<\/span>])\r\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>grp.describe()\r\n                                Close Amount\r\n                                       count         mean          std     min       <span class=\"hljs-number\">25<\/span>%       <span class=\"hljs-number\">50<\/span>%        <span class=\"hljs-number\">75<\/span>%       max\r\nDisposition     Claim Site\r\n-               -                       <span class=\"hljs-number\">34.0<\/span>     <span class=\"hljs-number\">0.000000<\/span>     <span class=\"hljs-number\">0.000000<\/span>    <span class=\"hljs-number\">0.00<\/span>    <span class=\"hljs-number\">0.0000<\/span>     <span class=\"hljs-number\">0.000<\/span>     <span class=\"hljs-number\">0.0000<\/span>      <span class=\"hljs-number\">0.00<\/span>\r\n                Bus Station              <span class=\"hljs-number\">2.0<\/span>     <span class=\"hljs-number\">0.000000<\/span>     <span class=\"hljs-number\">0.000000<\/span>    <span class=\"hljs-number\">0.00<\/span>    <span class=\"hljs-number\">0.0000<\/span>     <span class=\"hljs-number\">0.000<\/span>     <span class=\"hljs-number\">0.0000<\/span>      <span class=\"hljs-number\">0.00<\/span>\r\n                Checked Baggage       <span class=\"hljs-number\">2759.0<\/span>     <span class=\"hljs-number\">0.000000<\/span>     <span class=\"hljs-number\">0.000000<\/span>    <span class=\"hljs-number\">0.00<\/span>    <span class=\"hljs-number\">0.0000<\/span>     <span class=\"hljs-number\">0.000<\/span>     <span class=\"hljs-number\">0.0000<\/span>      <span class=\"hljs-number\">0.00<\/span>\r\n                Checkpoint             <span class=\"hljs-number\">903.0<\/span>     <span class=\"hljs-number\">0.000000<\/span>     <span class=\"hljs-number\">0.000000<\/span>    <span class=\"hljs-number\">0.00<\/span>    <span class=\"hljs-number\">0.0000<\/span>     <span class=\"hljs-number\">0.000<\/span>     <span class=\"hljs-number\">0.0000<\/span>      <span class=\"hljs-number\">0.00<\/span>\r\n                Motor Vehicle           <span class=\"hljs-number\">28.0<\/span>     <span class=\"hljs-number\">0.000000<\/span>     <span class=\"hljs-number\">0.000000<\/span>    <span class=\"hljs-number\">0.00<\/span>    <span class=\"hljs-number\">0.0000<\/span>     <span class=\"hljs-number\">0.000<\/span>     <span class=\"hljs-number\">0.0000<\/span>      <span class=\"hljs-number\">0.00<\/span>\r\n                Other                   <span class=\"hljs-number\">11.0<\/span>     <span class=\"hljs-number\">0.000000<\/span>     <span class=\"hljs-number\">0.000000<\/span>    <span class=\"hljs-number\">0.00<\/span>    <span class=\"hljs-number\">0.0000<\/span>     <span class=\"hljs-number\">0.000<\/span>     <span class=\"hljs-number\">0.0000<\/span>      <span class=\"hljs-number\">0.00<\/span>\r\nApprove <span class=\"hljs-keyword\">in<\/span> Full Checked Baggage       <span class=\"hljs-number\">1162.0<\/span>   <span class=\"hljs-number\">113.868072<\/span>   <span class=\"hljs-number\">192.166683<\/span>    <span class=\"hljs-number\">1.25<\/span>   <span class=\"hljs-number\">25.6600<\/span>    <span class=\"hljs-number\">60.075<\/span>   <span class=\"hljs-number\">125.9825<\/span>   <span class=\"hljs-number\">2200.00<\/span>\r\n                Checkpoint             <span class=\"hljs-number\">493.0<\/span>   <span class=\"hljs-number\">236.643367<\/span>   <span class=\"hljs-number\">404.707047<\/span>    <span class=\"hljs-number\">8.95<\/span>   <span class=\"hljs-number\">60.0000<\/span>   <span class=\"hljs-number\">124.000<\/span>   <span class=\"hljs-number\">250.1400<\/span>   <span class=\"hljs-number\">6183.36<\/span>\r\n                Motor Vehicle            <span class=\"hljs-number\">9.0<\/span>  <span class=\"hljs-number\">1591.428889<\/span>  <span class=\"hljs-number\">1459.368190<\/span>  <span class=\"hljs-number\">493.80<\/span>  <span class=\"hljs-number\">630.0000<\/span>   <span class=\"hljs-number\">930.180<\/span>  <span class=\"hljs-number\">1755.9800<\/span>   <span class=\"hljs-number\">5158.05<\/span>\r\n                Other                    <span class=\"hljs-number\">4.0<\/span>   <span class=\"hljs-number\">398.967500<\/span>   <span class=\"hljs-number\">358.710134<\/span>   <span class=\"hljs-number\">61.11<\/span>  <span class=\"hljs-number\">207.2775<\/span>   <span class=\"hljs-number\">317.385<\/span>   <span class=\"hljs-number\">509.0750<\/span>    <span class=\"hljs-number\">899.99<\/span>\r\nDeny            -                        <span class=\"hljs-number\">4.0<\/span>     <span class=\"hljs-number\">0.000000<\/span>     <span class=\"hljs-number\">0.000000<\/span>    <span class=\"hljs-number\">0.00<\/span>    <span class=\"hljs-number\">0.0000<\/span>     <span class=\"hljs-number\">0.000<\/span>     <span class=\"hljs-number\">0.0000<\/span>      <span class=\"hljs-number\">0.00<\/span>\r\n                Checked Baggage       <span class=\"hljs-number\">2333.0<\/span>     <span class=\"hljs-number\">0.000000<\/span>     <span class=\"hljs-number\">0.000000<\/span>    <span class=\"hljs-number\">0.00<\/span>    <span class=\"hljs-number\">0.0000<\/span>     <span class=\"hljs-number\">0.000<\/span>     <span class=\"hljs-number\">0.0000<\/span>      <span class=\"hljs-number\">0.00<\/span>\r\n                Checkpoint             <span class=\"hljs-number\">407.0<\/span>     <span class=\"hljs-number\">0.000000<\/span>     <span class=\"hljs-number\">0.000000<\/span>    <span class=\"hljs-number\">0.00<\/span>    <span class=\"hljs-number\">0.0000<\/span>     <span class=\"hljs-number\">0.000<\/span>     <span class=\"hljs-number\">0.0000<\/span>      <span class=\"hljs-number\">0.00<\/span>\r\n                Motor Vehicle            <span class=\"hljs-number\">1.0<\/span>     <span class=\"hljs-number\">0.000000<\/span>          NaN    <span class=\"hljs-number\">0.00<\/span>    <span class=\"hljs-number\">0.0000<\/span>     <span class=\"hljs-number\">0.000<\/span>     <span class=\"hljs-number\">0.0000<\/span>      <span class=\"hljs-number\">0.00<\/span>\r\n                Other                   <span class=\"hljs-number\">13.0<\/span>     <span class=\"hljs-number\">0.000000<\/span>     <span class=\"hljs-number\">0.000000<\/span>    <span class=\"hljs-number\">0.00<\/span>    <span class=\"hljs-number\">0.0000<\/span>     <span class=\"hljs-number\">0.000<\/span>     <span class=\"hljs-number\">0.0000<\/span>      <span class=\"hljs-number\">0.00<\/span>\r\nSettle          Checked Baggage        <span class=\"hljs-number\">432.0<\/span>   <span class=\"hljs-number\">286.271968<\/span>   <span class=\"hljs-number\">339.487254<\/span>    <span class=\"hljs-number\">7.25<\/span>   <span class=\"hljs-number\">77.0700<\/span>   <span class=\"hljs-number\">179.995<\/span>   <span class=\"hljs-number\">361.5700<\/span>   <span class=\"hljs-number\">2500.00<\/span>\r\n                Checkpoint             <span class=\"hljs-number\">254.0<\/span>   <span class=\"hljs-number\">487.173031<\/span>  <span class=\"hljs-number\">1620.156849<\/span>    <span class=\"hljs-number\">6.00<\/span>  <span class=\"hljs-number\">166.9250<\/span>   <span class=\"hljs-number\">281.000<\/span>   <span class=\"hljs-number\">496.3925<\/span>  <span class=\"hljs-number\">25483.44<\/span>\r\n                Motor Vehicle            <span class=\"hljs-number\">6.0<\/span>  <span class=\"hljs-number\">4404.910000<\/span>  <span class=\"hljs-number\">7680.169379<\/span>  <span class=\"hljs-number\">244.00<\/span>  <span class=\"hljs-number\">841.8125<\/span>  <span class=\"hljs-number\">1581.780<\/span>  <span class=\"hljs-number\">2215.5025<\/span>  <span class=\"hljs-number\">20000.00<\/span><\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<h3><span id=\"plotting\" class=\"blog__contents__anchor\"><\/span>Plotting<\/h3>\n<p>While aggregates on groups of data is one of the best ways to get insights, visualizing data lets patterns jump out from the page, and is straightforward for those who aren\u2019t as familiar with aggregate values. Properly formatted visualizations are critical to communicating meaning in the data, and it\u2019s nice to see that Pandas has some of these functions out of the box:<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\"><span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>df.plot(x=<span class=\"hljs-string\">'Incident Date'<\/span>, y=<span class=\"hljs-string\">'Close Amount'<\/span>)\r\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>plt.show()<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<div class=\"blog__image__center\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-695 size-large ls-is-cached lazyloaded\" src=\"https:\/\/kite.com\/wp-content\/uploads\/2019\/03\/image1.c9bae651-1024x768.png\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" srcset=\"https:\/\/kite.com\/wp-content\/uploads\/2019\/03\/image1.c9bae651-1024x768.png 1024w, https:\/\/kite.com\/wp-content\/uploads\/2019\/03\/image1.c9bae651-300x225.png 300w, https:\/\/kite.com\/wp-content\/uploads\/2019\/03\/image1.c9bae651-768x576.png 768w, https:\/\/kite.com\/wp-content\/uploads\/2019\/03\/image1.c9bae651-800x600.png 800w, https:\/\/kite.com\/wp-content\/uploads\/2019\/03\/image1.c9bae651-352x264.png 352w, https:\/\/kite.com\/wp-content\/uploads\/2019\/03\/image1.c9bae651.png 1280w\" alt=\"\" width=\"1024\" height=\"768\" data-src=\"\/wp-content\/uploads\/2019\/03\/image1.c9bae651-1024x768.png\" data-srcset=\"https:\/\/kite.com\/wp-content\/uploads\/2019\/03\/image1.c9bae651-1024x768.png 1024w, https:\/\/kite.com\/wp-content\/uploads\/2019\/03\/image1.c9bae651-300x225.png 300w, https:\/\/kite.com\/wp-content\/uploads\/2019\/03\/image1.c9bae651-768x576.png 768w, https:\/\/kite.com\/wp-content\/uploads\/2019\/03\/image1.c9bae651-800x600.png 800w, https:\/\/kite.com\/wp-content\/uploads\/2019\/03\/image1.c9bae651-352x264.png 352w, https:\/\/kite.com\/wp-content\/uploads\/2019\/03\/image1.c9bae651.png 1280w\" data-sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/div>\n<p>Incident Date by Close Amount<\/p>\n<\/div>\n<div class=\"content-block\">\n<h2><span id=\"exporting\" class=\"blog_contents_anchors\"><\/span>Exporting Transformed Data<\/h2>\n<p>Finally, we may need to commit either our original data, or the aggregates as a DataFrame to file format different than the one we started with, as Pandas does not limit you to writing back out to the same file format.<\/p>\n<p>The most common flat file to write to from Pandas will be the .csv. From the visualization, it looks like the cost of TSA claims, while occasionally very high due to some outliers is improving in 2015. We should probably recommend comparing staffing and procedural changes to continue in that direction, and explore in more detail why we have more incidents at certain times of day.<\/p>\n<p>Like loading data, Pandas offers a number of methods for writing your data to file in various formats. Writing back to an Excel file is slightly more involved than the others, so let\u2019s write to an even more portable format: CSV. To write your transformed dataset to a new CSV file:<\/p>\n<\/div>\n<div class=\"code-block python\">\n<pre><code class=\"Python hljs livecodeserver\"><span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>df.to_csv(path_or_buf=<span class=\"hljs-string\">'claims-2014.v1.csv'<\/span>)<\/code><\/pre>\n<\/div>\n<div class=\"content-block\">\n<h2>Final Notes<\/h2>\n<p>Here we\u2019ve seen a workflow that is both interesting and powerful. We\u2019ve taken a round-trip all the way from a\u00a0<i>government excel file<\/i>, into Python, through some fairly powerful data visualization, and back to a .csv file which could be more universally accessed\u2013all through the power of Pandas. Further, we\u2019ve covered the three central objects in Pandas \u2013 DataFrames, Series, and <code>dtypes<\/code>. Best of all, we have a\u00a0<i>deeper understanding<\/i>\u00a0of an interesting, real-world data set.<\/p>\n<p>These are the core concepts to understand when working with Pandas, and now you can ask intelligent questions (of yourself, or of Google) about these different objects. This TSA data use case has shown us exactly what Pandas is good for: the exploration, analysis, and aggregation of data to draw conclusions.<\/p>\n<p>The analysis and exploration of data is important in practically any field, but it is especially useful to Data Scientists and AI professionals who may need to crunch and clean data in very specific, finely-grained ways, like getting moving averages on stock ticks. Additionally, certain tasks may need to be automated, and this could prove difficult or expensive in sprawling applications like Excel, or Google Sheets, which may not offer all the functionality of Pandas with the full power of Python.<\/p>\n<p>Just imagine telling a business administrator that they may never have to run that broken spreadsheet macro ever again! Once analysis is automated, it can be deployed as a service or applied to hundreds of thousands of records streaming from a database. Alternatively, Pandas could be used to make critical decisions after establishing statistical associations between patterns, as indeed it is every day.<\/p>\n<p>Next, be sure to checkout at Python\u2019s extensive database libraries (e.g. SQLalchemy), or API clients (like the Google Sheets\/Slides Python Client or Airtable API to put your results in front of domain experts). The possibilities are endless, and are only enhanced by Python\u2019s mature libraries and active community.<\/p>\n<p class=\"blog__content--footer\">This post is a part of Kite\u2019s new series on Python. You can check out the code from this and other posts on our\u00a0<a href=\"https:\/\/github.com\/kiteco\/kite-python-blog-post-code\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub repository<\/a>.<\/p>\n<p><a href=\"https:\/\/kite.com\/blog\/python\/pandas-tutorial\/\" target=\"_blank\" rel=\"noopener noreferrer\">This article<\/a> originally appeared on <a href=\"https:\/\/kite.com\" target=\"_blank\" rel=\"noopener noreferrer\">Kite.com<\/a> (Reprinted with permission)<\/p>\n<\/div>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Zachary Wilson for Kite.com Tables of Contents Introduction to Pandas About the Data Setup Loading Data Basic Operations The Dtype Cleansing and Transforming Data Performing Basic Operations Calculations Booleans Grouping Plotting Exporting Transformed Data Final Notes Introduction to Pandas So, what is Pandas \u2013 practically speaking? In short, it\u2019s the major data analysis library [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":157382,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[],"tags":[],"class_list":["post-158100","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry"],"_links":{"self":[{"href":"https:\/\/www.investmacro.com\/forex\/wp-json\/wp\/v2\/posts\/158100","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.investmacro.com\/forex\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.investmacro.com\/forex\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.investmacro.com\/forex\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.investmacro.com\/forex\/wp-json\/wp\/v2\/comments?post=158100"}],"version-history":[{"count":1,"href":"https:\/\/www.investmacro.com\/forex\/wp-json\/wp\/v2\/posts\/158100\/revisions"}],"predecessor-version":[{"id":158101,"href":"https:\/\/www.investmacro.com\/forex\/wp-json\/wp\/v2\/posts\/158100\/revisions\/158101"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.investmacro.com\/forex\/wp-json\/wp\/v2\/media\/157382"}],"wp:attachment":[{"href":"https:\/\/www.investmacro.com\/forex\/wp-json\/wp\/v2\/media?parent=158100"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.investmacro.com\/forex\/wp-json\/wp\/v2\/categories?post=158100"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.investmacro.com\/forex\/wp-json\/wp\/v2\/tags?post=158100"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}