Posts in Category: Excel

Data Scraping for a Beach Vacation

With the proliferation of different ways to rent a beach house, finding he ideal property for a week at the beach has become an exhaustive exercise in researching multiple local beach rental properties, then checking on various owner managed efforts like VRBO and Airbnb. The filter conditions for each site vary such that searching for properties that are in a particular area (a map based search) sometimes isn’t available. Same is true for using the site to check on desirable options be it a pool, beach proximity, bedding, grill, etc.

The objective is to scrap the data from the various websites using free tools, consolidate those results into a spreadsheet or csv file, then use Tableau for filtering geographically as well as for amenities, price, location, etc.

Criteria

  • Emerald Isle, NC
  • Week of June 13 – June 20 (Saturday checkin/checkout)
  • 4 bedrooms with sleeping for 3 under age 6 and 3 sets of adults

Websites to consider:

  • Airbnb, Booking.com, VRBO (all these seem to already be consolidated by www.vacationrenter.com)
  • www.hometogo.com (which seems to consolidate Airbnb, Booking.com, VRBO, HomeAway, VacationRentals.com, TripAdvisor, e-domizil)
  • EmeraldIsleRealty.com
  • SunSurfRealty.com
  • www.bluewaternc.com/crystal-coast-vacation-rentals/

WEB Scrapers

The free webscrapers seem to be straightforward to configure (and have similar interfaces), so learning one scraper enables some carryover experience to use other scrapers. A different scraper was chosen when the first scraper chosen didn’t work for the next site. Wishing that one scraper would work on every site seems to be unattainable, but then again, I also wasn’t going to spend considerable time trying to figure out the peculiarities why a chosen scraper didn’t work (although I did spend some time going through tutorials to get comfortable that the approaches I was taking should have worked). I wanted to avoid writing code, so the scrapers I considered had to be suitable for non-programmers.

Summary of Scraping efforts (once a site had useful results, other approaches not attempted), so lots of blanks in table:

www.Import.io (have to use Chrome, Firefox or Safari)
www.Octoparse.com
www.parsehub.com
www.webscraper.io (chrome extension)
Google Sheets (using IMPORTHTML or IMPORTXML functions)
Colab (Python in cloud)

Consolidating data

The data from the various scrapers, while similar, had column variations – some included extra columns of URL links, some placed all the amenities in one cell, data and the element name was sometimes included in the same cell, etc. Getting the data into a consistent representation was done either in Excel or using Tableau Prep.

Adding missing Geocode data

The data from the websites had street address names and while the websites had map representations of the searched data, those maps didn’t expose the latitude and longitude used for each property.

Data Fields (as received)

The data model I've settled on requires splitting the data such that one table has a single row for each property, and for those fields which have multiple items (like amenities) to have those as a table with two columns – the property and the item. This setup will enable showcasing the new Tableau data model, forgoing the need for many LOD calculations.

Experience

SunSurf – Parsehub parsed data well, used Texas A&M to geocode. But, after reviewing results, had to re-parse list using Webscraper.io for missing data. Webscraper.io turned out to be the better option for this site.

BlueWater – Since Webscraper.io worked well for SunSurf, I started there, but it defied getting more than 7 rows of data through parsing (it would get stuck on the opening site, then when AJAX loaded the list of properties, my selection process or it as a tool got lost). Bluewater exposes a list of Property Names with URL to those properties, so if I could just get that list exposed via a URL, then I could use another tool to extract the data from it. So, I created my own version of the exposed list using F12 developer tools to copy html that had list of property names. Then loaded that html to a website so Parsehub could grab the list of property names and their associated URLs to an Excel sheet. When clicking on a listed name, the URL in the browser changed to a URL that included the property name. I then used that to model building the list of URLs for all the properties. This URL list was then used in Octoparse which permits loading up to a series of 10,000 webURLs to scrape. However, Octoparse free is throttled, so the 555 lines took 5 hours and 17 min to parse. The result, however provided me with street addresses, again, not quite clean as there was a carriage return (no space) before the city name making parsing out the address elements requiring using the substitute function in Excel to get the street address, city, state on one line, ready for further splitting needed to send to Geocoding. Once parsed, there were many rows that were duplicate addresses as would be expected for condo complexes. The duplicates were removed to get a unique list of addresses to submit to Texas A&M Geocoding, which in a few minutes geocoded the 385 records.

Emerald Isle Realty – Since Octoparse worked well for SunSurf, figured I’d start with that one. Caution -- I didn’t spend a lot of time trying to figure out whether a tool would work – if it didn’t work with limited instruction, then I just moved to the next tool. On this site, the street address was not a unique element on the page, but was sometimes buried in the description., so I used Excel to split out street address. Some descriptions lacked an address in the description, but did expose a mapped location on the website. While the site had a ‘mapped’ location for a property, the actual latitude/longitude pair used to place the marker on the map was not found for scraping. So, for these properties, I manually opened the property URL to compare to a second browser opened to google maps. By visually matching Google Maps to property website map, I then used Google Maps to obtain street address and Lat/Long coordinates. This process got quite tiresome, but fortunately there weren’t that many properties needing to fill in the missing address.

BlueWater Footnote – after scraping Emerald Isle Realty data, I noticed I didn’t get the image URL in my first attempt. Import.io worked OK for Emerald Isle properties and was the easiest to set up (it figures out what you are likely wanting and build the scraping model for you, which you can then adjust to add / remove columns). When I used this the second time, I figured out that to ‘train’ the Import.io process, choosing similar URL’s is absolutely necessary for it to be able to scrape properly. Also, their auto process picked up a bunch of URL’s very quickly and then presented a checkbox list of which group of URL’s were to be included in the scraping loop. There was a choice to pick up only the rental-property related URL’s, which when I selected only that subset, the training went exceptionally well, as were the results. Downside of Import.io – only 1000 webpages can be scraped per login per month, so after running it once incorrectly against 600+ sites (I had two groups selected, so import.io built a model based on one webpage that didn’t work for almost all of the pages – and thus, that was all I could do this month using that login identity).

Final Cleanup

Whew! Got the data downloaded, which even though was hoping that the data was similarly structured, it wasn’t between sites. So, the next step was to decide on a data table structure (data model) and load the scraped data into that model. This meant not including some data which was unique as the other sites didn’t have that same info (like customer or agency ratings and customer reviews). Some sites listed a few amenities into a scraping area, which was nice, except that they listed more amenities in their long description of the property. How amenities were identified meant that the data had to be conformed – that is ping-pong table and ping pong table would need to be adjusted to read one way or the other. Amenities by property are listed in their own tab. The idea is that some amenities may be considered a requirement (like pets allowed), while others are a nice to have. Those amenities which commonly are considered a requirement have a separate flag counter field in the RentalProperties tab. These are pets, pool, grill, game room, elevator, pier/dock, community pool, hot tub, boats allowed.

Prices – Some sites had prices for various rental options – weekly, two week or month rates, all different for different rental ranges. Others had just weekly rates for different ranges. I was only interested in a particular week, so that week’s rate is included as a column in the main RentalProperties tab. Even then, some properties scraping didn’t pick up a price, so a manual verification of those properties revealed three conditions – the property no longer was listed by the rental agency (it was listed when the list of property scrape occurred, but no longer existed a few weeks later when the price verification was being done). Second, the property is listed, but prices aren’t – instead ‘Advance Hold’ is listed where price was expected. I figure these are properties that had listed in the past, but haven’t signed the contract for this year yet. Third, the property is listed, but there is a note that no weekly rentals are available. This condition likely occurs when the property has been rented by the month, or because the contract hasn’t been signed for the period. All of these non-priced properties are included, but no pricing is available.

After sending the data for geocoding, there were many properties with the same street address (condominiums, for instance). I wanted Tableau to map those properties as separate map-points so when hovering, a person could get individual property information in a tooltip rather than a * indicating multiple responses.  So, in the data, I modified the Lat / long by 5/1000000 in order to nudge the mark enough to make it separately identifiable.

Recap - Recommendations - Results shared

I started preferring what I thought would be easiest - using the webscraper.io extension to Google Chrome.  While this was useful, it also revealed that some sites resist using this type of scraper -- i needs to build a list of URL's on the fly from the webscraping activity.  If that can't be accomplished, another tool / approach needs to be used.  Parsehub and Octoparse are similar, but not the same.  The interface on Parsehub took some getting used to, and quite frankly, I found myself backing out what I clicked to try again.  When you figure out it's very distinct clicking pattern to identify page elements to scrape, then this tool worked well.  It too, however, wanted to build the list of URLs to scrape itself, so when you have a long list of URL's already known, then Octoparse was the preferred tool.  Octoparse allowed for a long list of URLs to be copy/pasted into the tool, then used one of those URL's for you to set up the elements to scrape.  I also tried Google Sheets, but this seemed no to fit my particular need (again, I didn't spend much time there, only enough time to figure out it didn't get what I wanted with what I knew about the website).  You'll likely notice that Airbnb, Vrbo, etc. type sites aren't part of the results.  It isn't that we forgot about them, rather our limited knowledge of how these tools work and those sites extensive use of Ajax foiled our initial attempts and I ran out of patience/time.  I even sent the request to another person who applied Python to the HometoGo website, only to come away equally frustrated.  I'll likely look into getting this data into the dataset eventually, but not in the first exposure to the results.

I shared the results with the Piedmont Triad Tableau User group on February 6th 2020, showing them this PowerPoint presentation (click here to download).  The workbook has been loaded to Tableau Public for consumption, and looks like this (click on image to go to active workbook):

Next Steps:

My son wants to know if renting early gets either a better price or a better selection.  I'll check again in a few months to compare the results (which ones are available and what the prices are at the time).  Also, I'd like to get the missing VRBO, Airbnb, etc. properties into the data.

Gravatar

Tableau - Total one dimension in separate column

Well, many of us come to Tableau after being frustrated with Excel.  Then, to our surprise, Excel does some things easily that Tableau doesn't seem to handle well.  There are lots of posts concerning how to create subtotals and grandtotals that are customized to a particular need.  However, the problem that I had was that I wanted the Totals for a particular dimension to show as a single column -- that is, if using something like Superstore Sales, I want a crosstab that has Region as the first column discrete dimension, then year(Ship Date) as the sub-column.  Relatively simple, until I wanted to see the order of the Columns to be like this:
 

Total Central East
2011 2012 2013 2014 2015 2011 2012 2013 2014 2015 2011 2012 2013 2014 2015
                             

I only show two of the Regions, but the idea remains -- how can I get the Years' Total to show up in this fashion within Tableau?

Well, the answer is in custom SQL -- I union the source table with itself, substituting the word 'Total' for the Dimension that is listed at the top -- in the above example, that would be Region.  So, using Superstore Sales, which is an Excel file, that means using the legacy connection so that Custom SQL can be created.

Basic overview of SQL:

Select dimension1, dimension2, dimension3, measure1 from Source
Union
Select 'Total' as dimension1, dimension2, dimension3, measure1 from Source

What this does is adds a 'Total' value to all the rows for that one selected dimension, so that when the dimension is selected for a crosstab, not only are all the values present, but there is a new value also available - called 'Total'.  Using Sum on the measure will result in correct values for all the pieces and the Total, which then can be broken down by any of the other dimensions as a second, third, etc. column.

With the above in mind, we can get a bit fancier.  Tableau permits sending a parameter value to the connection -- great, now instead of creating a connection for each dimension that I might want to have a Total presented in the above manner, I can instead use a comparison statement to the parameter to determine whether that field value will be 'Total' or what is on the row.  For most SQL, this comparison would be done using a Case statement -- nothing too difficult about that.  But, Superstore Sales is an Excel sheet -- and Excel doesn't permit Case statements in Custom SQL.  Instead of a Case statement, a Switch statement is used in the SQL.  First steps first, however, create the Parameter and fill the list with the names of the Dimensions that you may want to show as a total.  (BTW, if you pick a field that normally would be shown as a date, it will instead be treated as a text field -- you'll have to create a calculated field to convert the text field to a date field should you want to use Tableau's built in date capabilities).

So, after creating the Parameter (in my case it was named 'Total Field Chooser'), the above SQL will look like this <>:

Select dimension1, dimension2, dimension3, measure1 from Source
Union
Select
Switch(<Parameters.Total Field Chooser> = 'dimension1', 'Total', true, [Source].[dimension1]) AS [dimension1],
Switch(<Parameters.Total Field Chooser> = 'dimension2', 'Total', true, [Source].[dimension2]) AS [dimension2],
Switch(<Parameters.Total Field Chooser> = 'dimension3', 'Total', true, [Source].[dimension3]) AS [dimension3],
measure1 from Source

So, with the above in place as Custom SQL, I can modify the parameter and get the selected dimension filled with the word 'Total'.  One more task - create a calculated field that interprets the Parameter to select the Dimension -- for this, a simple Case statement in Tableau works fine. I named the calculated field 'Selected Total Field'

CASE [Total Field Chooser]
WHEN 'dimension1' then [dimension1]
WHEN 'dimension2' then [dimension2]
WHEN 'dimension3' then [dimension3]
END

Download the workbook (version 9.3) to see how this is built - it has three tabs, one connecting to Superstore Sales to show what Tableau's layout options are out of the box.  There is another with a 'fixed' column (Segment) that then has Total added.  Finally, there is a Custom SQL with Parameter driven version that permits a person to choose which Field is to have the Totals.  Enjoy.

Tableau Version 9.3 download

Gravatar

TEDx Greensboro

TEDx Greensboro is a local event of TED Talks (TED = Technology, Entertainment, Design).  I was fortunate to attend this inagural event in Greensboro (attendence was choosen through lottery).  The theme was 'Dreamsboro' -- with the idea that presentations would focus on actions that make Greensboro, NC a place where folks want to live, work, play, retire, and do other good things.

In the first few minutes of the TEDx event, the moderator, Justin Catanoso asked that cell phones, etc. be turned off.  So, in the first few minutes, there wasn't much activity.  Then, slowly, folks couldn't contain their enthusiasm any longer and the tweeting began in earnest.

The viz disects the words used by the Tweeters as well is data about who tweeted, which speakers were mentioned, what tags were used.  A user has the ability to see the Tweeters' profile, the tweet itself or open links mentioned within a tweet. 

Now, let's talk about how this viz was built --

First, it was inspired by Andy Cotgreave's efforts that culled Tweets and used at the Tableau Customer Conference 2012, as well as other conferences (directions found here).  To obtain the raw data, I installed Python and ran the Python script to create a CSV file of the tweets.  That was the easy part.

I wanted to extend Andy's approach so that I could use Tableau 8's new word cloud viz.  To do that, I needed to split the tweets into separate words.  Likely this could have been done in Python, as likely could have the rest of the tasks, but I used what I knew well -- Excel.  I created a macro that splits the tweets into words - each word placed into a separate column on the same line as the tweet.  What did I have now?  The basic structure needed to use the Tableau Reshaper Tool Add-In (found here).  This then gave me a row for each word, along with the full tweet info.

Next, I found that many words were noise.  Adjectives, pronouns, mis-spellings, etc. existed.  A little manual effort identified these noise words and set them aside in a separate worksheet tab.  I then connected Tableau public to the Excel, created a join condition that excluded rows with the noise words.  From there it was just a matter of creating the word clouds, tweet timeline, and some parameter driven choices for how to present the data.

Did I mention that all the tools used for data acquisition, reshaping and vizualization are free?  Python, Tableau Data Reshaper, and Tableau Public will handle up to ONE MILLION rows of data!  You can use the above guidance to recreate a viz of your own against your own media event.  I entered this into a 2013 Tableau User Conference contest, please tweet about the viz using the hashtag #TableauTEDxGSO.

Hope you enjoy!

Gravatar
Page 1 of 2 1 2 > >>
Copyright© Brad Earle
Site Map | Printable View | © 2008 - 2021 KB Earle Associates LLC | |