Posts in Category: Tableau

Data Scraping for a Beach Vacation

With the proliferation of different ways to rent a beach house, finding he ideal property for a week at the beach has become an exhaustive exercise in researching multiple local beach rental properties, then checking on various owner managed efforts like VRBO and Airbnb. The filter conditions for each site vary such that searching for properties that are in a particular area (a map based search) sometimes isn’t available. Same is true for using the site to check on desirable options be it a pool, beach proximity, bedding, grill, etc.

The objective is to scrap the data from the various websites using free tools, consolidate those results into a spreadsheet or csv file, then use Tableau for filtering geographically as well as for amenities, price, location, etc.

Criteria

  • Emerald Isle, NC
  • Week of June 13 – June 20 (Saturday checkin/checkout)
  • 4 bedrooms with sleeping for 3 under age 6 and 3 sets of adults

Websites to consider:

  • Airbnb, Booking.com, VRBO (all these seem to already be consolidated by www.vacationrenter.com)
  • www.hometogo.com (which seems to consolidate Airbnb, Booking.com, VRBO, HomeAway, VacationRentals.com, TripAdvisor, e-domizil)
  • EmeraldIsleRealty.com
  • SunSurfRealty.com
  • www.bluewaternc.com/crystal-coast-vacation-rentals/

WEB Scrapers

The free webscrapers seem to be straightforward to configure (and have similar interfaces), so learning one scraper enables some carryover experience to use other scrapers. A different scraper was chosen when the first scraper chosen didn’t work for the next site. Wishing that one scraper would work on every site seems to be unattainable, but then again, I also wasn’t going to spend considerable time trying to figure out the peculiarities why a chosen scraper didn’t work (although I did spend some time going through tutorials to get comfortable that the approaches I was taking should have worked). I wanted to avoid writing code, so the scrapers I considered had to be suitable for non-programmers.

Summary of Scraping efforts (once a site had useful results, other approaches not attempted), so lots of blanks in table:

www.Import.io (have to use Chrome, Firefox or Safari)
www.Octoparse.com
www.parsehub.com
www.webscraper.io (chrome extension)
Google Sheets (using IMPORTHTML or IMPORTXML functions)
Colab (Python in cloud)

Consolidating data

The data from the various scrapers, while similar, had column variations – some included extra columns of URL links, some placed all the amenities in one cell, data and the element name was sometimes included in the same cell, etc. Getting the data into a consistent representation was done either in Excel or using Tableau Prep.

Adding missing Geocode data

The data from the websites had street address names and while the websites had map representations of the searched data, those maps didn’t expose the latitude and longitude used for each property.

Data Fields (as received)

The data model I've settled on requires splitting the data such that one table has a single row for each property, and for those fields which have multiple items (like amenities) to have those as a table with two columns – the property and the item. This setup will enable showcasing the new Tableau data model, forgoing the need for many LOD calculations.

Experience

SunSurf – Parsehub parsed data well, used Texas A&M to geocode. But, after reviewing results, had to re-parse list using Webscraper.io for missing data. Webscraper.io turned out to be the better option for this site.

BlueWater – Since Webscraper.io worked well for SunSurf, I started there, but it defied getting more than 7 rows of data through parsing (it would get stuck on the opening site, then when AJAX loaded the list of properties, my selection process or it as a tool got lost). Bluewater exposes a list of Property Names with URL to those properties, so if I could just get that list exposed via a URL, then I could use another tool to extract the data from it. So, I created my own version of the exposed list using F12 developer tools to copy html that had list of property names. Then loaded that html to a website so Parsehub could grab the list of property names and their associated URLs to an Excel sheet. When clicking on a listed name, the URL in the browser changed to a URL that included the property name. I then used that to model building the list of URLs for all the properties. This URL list was then used in Octoparse which permits loading up to a series of 10,000 webURLs to scrape. However, Octoparse free is throttled, so the 555 lines took 5 hours and 17 min to parse. The result, however provided me with street addresses, again, not quite clean as there was a carriage return (no space) before the city name making parsing out the address elements requiring using the substitute function in Excel to get the street address, city, state on one line, ready for further splitting needed to send to Geocoding. Once parsed, there were many rows that were duplicate addresses as would be expected for condo complexes. The duplicates were removed to get a unique list of addresses to submit to Texas A&M Geocoding, which in a few minutes geocoded the 385 records.

Emerald Isle Realty – Since Octoparse worked well for SunSurf, figured I’d start with that one. Caution -- I didn’t spend a lot of time trying to figure out whether a tool would work – if it didn’t work with limited instruction, then I just moved to the next tool. On this site, the street address was not a unique element on the page, but was sometimes buried in the description., so I used Excel to split out street address. Some descriptions lacked an address in the description, but did expose a mapped location on the website. While the site had a ‘mapped’ location for a property, the actual latitude/longitude pair used to place the marker on the map was not found for scraping. So, for these properties, I manually opened the property URL to compare to a second browser opened to google maps. By visually matching Google Maps to property website map, I then used Google Maps to obtain street address and Lat/Long coordinates. This process got quite tiresome, but fortunately there weren’t that many properties needing to fill in the missing address.

BlueWater Footnote – after scraping Emerald Isle Realty data, I noticed I didn’t get the image URL in my first attempt. Import.io worked OK for Emerald Isle properties and was the easiest to set up (it figures out what you are likely wanting and build the scraping model for you, which you can then adjust to add / remove columns). When I used this the second time, I figured out that to ‘train’ the Import.io process, choosing similar URL’s is absolutely necessary for it to be able to scrape properly. Also, their auto process picked up a bunch of URL’s very quickly and then presented a checkbox list of which group of URL’s were to be included in the scraping loop. There was a choice to pick up only the rental-property related URL’s, which when I selected only that subset, the training went exceptionally well, as were the results. Downside of Import.io – only 1000 webpages can be scraped per login per month, so after running it once incorrectly against 600+ sites (I had two groups selected, so import.io built a model based on one webpage that didn’t work for almost all of the pages – and thus, that was all I could do this month using that login identity).

Final Cleanup

Whew! Got the data downloaded, which even though was hoping that the data was similarly structured, it wasn’t between sites. So, the next step was to decide on a data table structure (data model) and load the scraped data into that model. This meant not including some data which was unique as the other sites didn’t have that same info (like customer or agency ratings and customer reviews). Some sites listed a few amenities into a scraping area, which was nice, except that they listed more amenities in their long description of the property. How amenities were identified meant that the data had to be conformed – that is ping-pong table and ping pong table would need to be adjusted to read one way or the other. Amenities by property are listed in their own tab. The idea is that some amenities may be considered a requirement (like pets allowed), while others are a nice to have. Those amenities which commonly are considered a requirement have a separate flag counter field in the RentalProperties tab. These are pets, pool, grill, game room, elevator, pier/dock, community pool, hot tub, boats allowed.

Prices – Some sites had prices for various rental options – weekly, two week or month rates, all different for different rental ranges. Others had just weekly rates for different ranges. I was only interested in a particular week, so that week’s rate is included as a column in the main RentalProperties tab. Even then, some properties scraping didn’t pick up a price, so a manual verification of those properties revealed three conditions – the property no longer was listed by the rental agency (it was listed when the list of property scrape occurred, but no longer existed a few weeks later when the price verification was being done). Second, the property is listed, but prices aren’t – instead ‘Advance Hold’ is listed where price was expected. I figure these are properties that had listed in the past, but haven’t signed the contract for this year yet. Third, the property is listed, but there is a note that no weekly rentals are available. This condition likely occurs when the property has been rented by the month, or because the contract hasn’t been signed for the period. All of these non-priced properties are included, but no pricing is available.

After sending the data for geocoding, there were many properties with the same street address (condominiums, for instance). I wanted Tableau to map those properties as separate map-points so when hovering, a person could get individual property information in a tooltip rather than a * indicating multiple responses.  So, in the data, I modified the Lat / long by 5/1000000 in order to nudge the mark enough to make it separately identifiable.

Recap - Recommendations - Results shared

I started preferring what I thought would be easiest - using the webscraper.io extension to Google Chrome.  While this was useful, it also revealed that some sites resist using this type of scraper -- i needs to build a list of URL's on the fly from the webscraping activity.  If that can't be accomplished, another tool / approach needs to be used.  Parsehub and Octoparse are similar, but not the same.  The interface on Parsehub took some getting used to, and quite frankly, I found myself backing out what I clicked to try again.  When you figure out it's very distinct clicking pattern to identify page elements to scrape, then this tool worked well.  It too, however, wanted to build the list of URLs to scrape itself, so when you have a long list of URL's already known, then Octoparse was the preferred tool.  Octoparse allowed for a long list of URLs to be copy/pasted into the tool, then used one of those URL's for you to set up the elements to scrape.  I also tried Google Sheets, but this seemed no to fit my particular need (again, I didn't spend much time there, only enough time to figure out it didn't get what I wanted with what I knew about the website).  You'll likely notice that Airbnb, Vrbo, etc. type sites aren't part of the results.  It isn't that we forgot about them, rather our limited knowledge of how these tools work and those sites extensive use of Ajax foiled our initial attempts and I ran out of patience/time.  I even sent the request to another person who applied Python to the HometoGo website, only to come away equally frustrated.  I'll likely look into getting this data into the dataset eventually, but not in the first exposure to the results.

I shared the results with the Piedmont Triad Tableau User group on February 6th 2020, showing them this PowerPoint presentation (click here to download).  The workbook has been loaded to Tableau Public for consumption, and looks like this (click on image to go to active workbook):

Next Steps:

My son wants to know if renting early gets either a better price or a better selection.  I'll check again in a few months to compare the results (which ones are available and what the prices are at the time).  Also, I'd like to get the missing VRBO, Airbnb, etc. properties into the data.

Gravatar

Tableau Server refresh all extracts using REST API

Just worked through the 10.3 REST API for refreshing data extracts on the server.  The documentation language was all there, just not tremedously clear as the explanation paragraphs state that some parameters need to appear, yet the example doesn't reflect those parameters being present.  The missing piece -- adding the -ContentType parameter to the invoke statement is alluded to in the instructions, but as stated before, not present in the example.

I work with PowerShell to invoke the REST API as PowerShell comes with every Windows PC, which is what use.  The syntax shown below would need to be changed if some other shell language were used.

 

Some caveats to get PowerShell to run a script on your PC:


1. If you want to run this as a script from a saved file (extension to save PowerShell script files is .ps1), then you find that the PowerShell won't run the script, it is likely due to the PC having Restricted the ability to Execute scripts.  This must be changed in order to run a script from a file.  Some people suggest resetting the script Execution policy to unrestricted, but that opens the PC up to unknowns when also browsing the internet.  Instead, setting the policy to 'remotesigned' permits only local scripts and signed internet scripts to run.  After running your local scripts on a PC, for safety, you can also just set the Execution policy back to 'restricted'.

 

To reset the Execution policy, the logged in person must have PC admin rights AND the Windows PowerShell ISE must be launched with the 'Run as Administrator' option.

Then run these two command -- the first is to obtain the policy as it currently exists (should you want to revert) and the second changes the policy.  The policy cannot be changed from within a script, so these commands must be run manually.

$original_policy = Get-ExecutionPolicy

Set-ExecutionPolicy remotesigned

 

2. The default TLS protocols on Windows 10 PC's doesn't typically enable all the protocols needed for the WebMethods to work properly.  Enable all three TLS protocols (TLS, TLS1.1, TLS1.2) by also running this PowerShell command while you are in 'Run as Administrator' mode.

[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls -bor [Net.SecurityProtocolType]::Tls11 -bor [Net.SecurityProtocolType]::Tls12

 

Powershell to Kickoff Tableau Server Extract Refresh - all extracts:

 

#Tableau Server - where data resides and extracts need refreshing

# Setup to connect

$server = "http://tableauserver"

$s = Invoke-RestMethod -Uri $server/api/2.4/serverinfo -Method get #works on server version 10.1 and later

$api = $s.tsResponse.serverInfo.restApiVersion  #10.3 server

 

$username = “some_tableau_admin_username”

$password = “password for above user”

$sitelogin = "mysite"  #site name where extracts exist

 

# generate body for sign in

$signin_body = (’<tsRequest>

  <credentials name=“’ + $username + ’” password=“’+ $password + ’” >

    <site contentUrl="'+$sitelogin +'" />

  </credentials>

</tsRequest>’)

$response = Invoke-RestMethod -Uri $server/api/$api/auth/signin -Body $signin_body -Method post

# save the auth token, site id and my user id

$authToken = $response.tsResponse.credentials.token

$siteID = $response.tsResponse.credentials.site.id

$myUserID = $response.tsResponse.credentials.user.id

$siteURL = $response.tsResponse.credentials.site.contentUrl

 

# set up header fields with auth token

$headers = New-Object “System.Collections.Generic.Dictionary[[String],[String]]”

 

# add X-Tableau-Auth header with our auth token

$headers.Add(“X-Tableau-Auth”, $authToken)

 

#tests whether logged in user is an Administrator  (site or server)

$loginUserid = Invoke-RestMethod -Uri $server/api/$api/sites/$siteID/users/$myUserID -Headers $headers -Method Get

$admin = $loginUserid.tsResponse.user.siteRole -like "*Administrator"

 

#create body element required for refresh extract POST

$refresh_body = (

’<tsRequest>

</tsRequest>’

)

 

# only admin can perform extractRefresh, so this next set of commands only execute when logged in user is an admin

if ($admin)

{

#get list of extract tasks from server -- Tableau 10.3 REST API can only kick off an extract task, not an individual Extract refresh

#this gets list of all extracts -- no filter applied to select a schedule, owner, priority, workbook, etc.

$extractlist = Invoke-RestMethod -Uri $server/api/$api/sites/$siteID/tasks/extractRefreshes -Headers $headers -Method Get

 

Foreach ($taskid in $extractlist.tsResponse.tasks.task.extractrefresh.id)

{

$runnow = Invoke-RestMethod -Uri $server/api/$api/sites/$siteID/tasks/extractRefreshes/$taskid/runNow -Headers $headers -Body $refresh_body -ContentType "application/xml" -Method POST

}

}

Gravatar

Tableau - Total one dimension in separate column

Well, many of us come to Tableau after being frustrated with Excel.  Then, to our surprise, Excel does some things easily that Tableau doesn't seem to handle well.  There are lots of posts concerning how to create subtotals and grandtotals that are customized to a particular need.  However, the problem that I had was that I wanted the Totals for a particular dimension to show as a single column -- that is, if using something like Superstore Sales, I want a crosstab that has Region as the first column discrete dimension, then year(Ship Date) as the sub-column.  Relatively simple, until I wanted to see the order of the Columns to be like this:
 

Total Central East
2011 2012 2013 2014 2015 2011 2012 2013 2014 2015 2011 2012 2013 2014 2015
                             

I only show two of the Regions, but the idea remains -- how can I get the Years' Total to show up in this fashion within Tableau?

Well, the answer is in custom SQL -- I union the source table with itself, substituting the word 'Total' for the Dimension that is listed at the top -- in the above example, that would be Region.  So, using Superstore Sales, which is an Excel file, that means using the legacy connection so that Custom SQL can be created.

Basic overview of SQL:

Select dimension1, dimension2, dimension3, measure1 from Source
Union
Select 'Total' as dimension1, dimension2, dimension3, measure1 from Source

What this does is adds a 'Total' value to all the rows for that one selected dimension, so that when the dimension is selected for a crosstab, not only are all the values present, but there is a new value also available - called 'Total'.  Using Sum on the measure will result in correct values for all the pieces and the Total, which then can be broken down by any of the other dimensions as a second, third, etc. column.

With the above in mind, we can get a bit fancier.  Tableau permits sending a parameter value to the connection -- great, now instead of creating a connection for each dimension that I might want to have a Total presented in the above manner, I can instead use a comparison statement to the parameter to determine whether that field value will be 'Total' or what is on the row.  For most SQL, this comparison would be done using a Case statement -- nothing too difficult about that.  But, Superstore Sales is an Excel sheet -- and Excel doesn't permit Case statements in Custom SQL.  Instead of a Case statement, a Switch statement is used in the SQL.  First steps first, however, create the Parameter and fill the list with the names of the Dimensions that you may want to show as a total.  (BTW, if you pick a field that normally would be shown as a date, it will instead be treated as a text field -- you'll have to create a calculated field to convert the text field to a date field should you want to use Tableau's built in date capabilities).

So, after creating the Parameter (in my case it was named 'Total Field Chooser'), the above SQL will look like this <>:

Select dimension1, dimension2, dimension3, measure1 from Source
Union
Select
Switch(<Parameters.Total Field Chooser> = 'dimension1', 'Total', true, [Source].[dimension1]) AS [dimension1],
Switch(<Parameters.Total Field Chooser> = 'dimension2', 'Total', true, [Source].[dimension2]) AS [dimension2],
Switch(<Parameters.Total Field Chooser> = 'dimension3', 'Total', true, [Source].[dimension3]) AS [dimension3],
measure1 from Source

So, with the above in place as Custom SQL, I can modify the parameter and get the selected dimension filled with the word 'Total'.  One more task - create a calculated field that interprets the Parameter to select the Dimension -- for this, a simple Case statement in Tableau works fine. I named the calculated field 'Selected Total Field'

CASE [Total Field Chooser]
WHEN 'dimension1' then [dimension1]
WHEN 'dimension2' then [dimension2]
WHEN 'dimension3' then [dimension3]
END

Download the workbook (version 9.3) to see how this is built - it has three tabs, one connecting to Superstore Sales to show what Tableau's layout options are out of the box.  There is another with a 'fixed' column (Segment) that then has Total added.  Finally, there is a Custom SQL with Parameter driven version that permits a person to choose which Field is to have the Totals.  Enjoy.

Tableau Version 9.3 download

Gravatar
Page 1 of 5 1 2 3 4 5 > >>
Copyright© Brad Earle
Site Map | Printable View | © 2008 - 2021 KB Earle Associates LLC | |