Tech and T-Accounts

Foraging through Teradata

I have been asked to find data in a Teradata data warehouse without being provided any contextual 'sherpa guide'.  Fortunately, the requested person is familiar with the data that they seek and already has sourced a vetted version of the data through other means.  When that person then wants the data to come from a Teradata connection, it is typically because the vetted source doesn't have a convienent means to enable Tableau to connect to the data directly, but the requesting party knows that the data resides in the data warehouse and has obtained proper permissions to read the data directly.

After securing permissions and having knowledge of what the 'right' data should look like, I then connect to Teradata only to find something like 1 million unique Database/Table/Column combinations.  To assist my finding the data, I need to create my own sherpa guide.  Fortunately, DB's that run Teradata DW typically have instilled common naming conventions to their databases/tables/columns and rarely have obfuscated those items.  Connecting Tableau to the DBC database and the columsv view provides a comprehensive list of databases, tables and columns that then can be filtered using wildcard matching to find potential data sources for the requested data.

After finding potentials, creating a connection to each found db/table enables the data discovery (and lets you know if you have permissions to see that particular db/table/column).  The last step is to verify the quality of the data to the known vetted source.  With the last step complete, going further to actually create data visualization stories with Tableau can begin.


Tableau Server: Extracts, Data Connections, Data Sources

A post on the Tableau Linked In forum got me thinking about how to explain about Tableau Server (version 6-8) Data Sources.  A Data Source published to Tableau Server is a convenient method to share the data connection instructions (meta-data) including dimensions, measures, sets and any calculated fields.  Tableau covers some of the issues related to Data Sources in its knowledge base article found here.

But, the differences are quite perplexing and subtle, and which combination of tactics is used depends on the speed of the actual data repository connection and the reason for using Tableau server access rights to expose a data source rather than a file based solution (save a .TDS file to a network drive).  Below is an explanation of the terms as well as insights into considerations of the various elements.

Extracts - A copy of the data is placed in a proprietary database engine.  Use extracts when the direct connect to the data is cumbersomely slow, the data is a snapshot, or the features of the underlying data source lack functionality (like the MS Jet engine)

Data Connection -- When authoring a workbook, data connections are created to obtain data from a specific data repository.  Repetitive data connections can be made to the same data repository (a global filter applies to a specific data connection.  A use case for repetitive connections is when there are some worksheets created that use a global filter and others in the same workbook that shouldn't have the global filter applied).  A data connection is a SQL statement that is comprises of either a single table, multiple tables with join conditions or custom SQL.

Data Source -- This is a Tableau term for the xml instructions that can be saved either to a file or to the server.  The actual data is not contained in these files or loaded to the server. Data Sources can include instructions about how and where an extract can be found. A data source only includes the metadata that tells Tableau how to create the data connection and what formulas and native data elements exist in that data connection. It does not include formatting.

You can make it seem that the data is on the tableau server. When an extract is saved to a network location (use the UNC filename, not mapped drives) that the server and user can also access, then when a user selects 'Import Data Source' from the server, the data, dimensions, measures and sets appear in the workbook because the user has access to the data extract that is defined in the data source, not because the data resides on Tableau Server.

Version 6 permits saving extract files to a network drive, while prior versions required extracts to reside on a local drive.  Extracts on a network drive can impact performance.

Publishing Data Sources to a server is a method to share a common set of formulas, etc. with other server users (the publisher chooses who to give permissions).  If the connection requires database permissions, then those credentials will have to be entered by the user to access the actual data.  However, come caveats: 
1. Read only vs. update permissions to the extract located on a network drive have to be managed by network drive access permissions. 
2. If the network extract file is open by any process, then the extract cannot be updated.  A network drive extract file can have several workbooks read the file at the same time.
3. A workbook with update file rights can add dimensions, measures, sets to the data source and re-generate the extract.  The new dimensions aren't shared, however, unless the data source is re-published to the server.  When other workbooks open, they see a message "A newer version of the published data source is available.  Would you like to use the latest version?"  Answering no means you get to the data source as before, but use the old XML definition that was in the workbook. Some elements could be broken in your workbook due to renamed or removed items from the extract file. Answering yes brings in the new elements.

If the data connection was saved with instructions to use it as an extract, then a popup appears to either:
 1. locate the  the extract (which could be on a network drive)
 2. remove the extract (it will then require appropriate permission to use the data connection)
 3. Deactivate the extract (keeps the extract info in the xml file, but doesn't use that information -- instead requires appropriate permission to use the data connection)  If the extract file didn't exist before, an error will occur when attempting to regenerate the extract.
 4. Regenerate the extract (careful!  this regenerates the extract in the same location where it was originally generated - so to do this you will need permission to use the data connection and permission to save the extract wherever it was - it shows you a shortened version of the save location, but you don't have a choice to change this location)

So, lets complicate it some more:
When you publish a workbook to the server and it uses an extract, the extract is pulled from the network/local drive and placed on the server as part of a packaged workbook.  This is a copy of the extract and not the local/network located data source.  So, if the loc data source is modified and saved to the network, the Tableau server version of the workbook doesn't pull from that location, instead it pulls from the extract copy.  The meta-data about the data source is frozen as of the time that the server version of the workbook was published.  To get the server version to recognize changes, the workbook needs to be re-published with the updated data source information to be embedded into a packaged workbook.

I recently had an issue where I needed to have both a live connection to an Excel workbook and a MS SQL Server connection as a data extract in the same workbook.  By saving the workbook as a twb file, the live connection to the local file remained active.  If I saved the workbook as a twbx, then a copy of the Excel file was embedded in the workbook and the live connection was lost.  So, here comes the tricky part.  I wanted to publish this workbook to Tableau Server with a live connection to the Excel file.  The problem happens when publishing, Tableau automatically 'packages' the workbook for consumption to the server when there is an extract present.  Packaging is another term for what happens when a twbx is created.  So, when I published the workbook to the server, the Excel file connection no longer pointed to the local resource (I used UNC naming convention to point it to a network share that the Tableau server could access).  To overcome this issue, I had to publish the Excel workbook as a datasource to the Tableau Server -- but unchecking the box 'Include External Files', thereby making the datasource a live read to the Excel file.  Last, I created a new data connection to the this Tableau server datasource, then replaced the original Excel Datasource with this new one.  Now, when I publish the workbook to the server, the live connection doesn't break, as it is defined by the published datasource, not by the workbook.  Even as a packaged workbook, the live link remains working.


TEDx Greensboro

TEDx Greensboro is a local event of TED Talks (TED = Technology, Entertainment, Design).  I was fortunate to attend this inagural event in Greensboro (attendence was choosen through lottery).  The theme was 'Dreamsboro' -- with the idea that presentations would focus on actions that make Greensboro, NC a place where folks want to live, work, play, retire, and do other good things.

In the first few minutes of the TEDx event, the moderator, Justin Catanoso asked that cell phones, etc. be turned off.  So, in the first few minutes, there wasn't much activity.  Then, slowly, folks couldn't contain their enthusiasm any longer and the tweeting began in earnest.

The viz disects the words used by the Tweeters as well is data about who tweeted, which speakers were mentioned, what tags were used.  A user has the ability to see the Tweeters' profile, the tweet itself or open links mentioned within a tweet. 

Now, let's talk about how this viz was built --

First, it was inspired by Andy Cotgreave's efforts that culled Tweets and used at the Tableau Customer Conference 2012, as well as other conferences (directions found here).  To obtain the raw data, I installed Python and ran the Python script to create a CSV file of the tweets.  That was the easy part.

I wanted to extend Andy's approach so that I could use Tableau 8's new word cloud viz.  To do that, I needed to split the tweets into separate words.  Likely this could have been done in Python, as likely could have the rest of the tasks, but I used what I knew well -- Excel.  I created a macro that splits the tweets into words - each word placed into a separate column on the same line as the tweet.  What did I have now?  The basic structure needed to use the Tableau Reshaper Tool Add-In (found here).  This then gave me a row for each word, along with the full tweet info.

Next, I found that many words were noise.  Adjectives, pronouns, mis-spellings, etc. existed.  A little manual effort identified these noise words and set them aside in a separate worksheet tab.  I then connected Tableau public to the Excel, created a join condition that excluded rows with the noise words.  From there it was just a matter of creating the word clouds, tweet timeline, and some parameter driven choices for how to present the data.

Did I mention that all the tools used for data acquisition, reshaping and vizualization are free?  Python, Tableau Data Reshaper, and Tableau Public will handle up to ONE MILLION rows of data!  You can use the above guidance to recreate a viz of your own against your own media event.  I entered this into a 2013 Tableau User Conference contest, please tweet about the viz using the hashtag #TableauTEDxGSO.

Hope you enjoy!

Page 5 of 7 << < 1 2 3 4 5 6 7 > >>
Copyright© Brad Earle
Site Map | Printable View | © 2008 - 2021 KB Earle Associates LLC | |