Defining Unstructured Data
Before we can tell you why managing unstructured data is important we must define it. First let’s talk about structured data which is data that has structure to it. Think of a database with tables and columns. Each column in a table is structured by the type of data allowed in it. The opposite of structured data and the purpose of this post is unstructured data (everything else). We define unstructured data is simply information that either does not have a pre-defined data or is not organized in a pre-determined format. This data is what we store on our servers, laptops and cloud in the form of documents (spreadsheets, power point, word, etc).
What is in Unstructured Data that is so important?
Because Content is King!
Content is King actually dates back to 1974 when the words were written by J.W. Click and Russell N. Baird. Of course, this phrase did not become popular until the 90’s when Bill Gates actually wrote an article entitled with the same.
Without content would Google be the company it is today? Without content would the internet be useful? Google exists because we need to be able to know what is out there. All that unstructured data needs to be indexed and searchable. The same goes with the data you store on your network or in the cloud. In business we create our proposals, we create our TPS reports, we create memos, procedures and we even mail merge data from our structured data. We even export data from structured data to be analyzed and then we store that data on our network shares, laptops and cloud storage. That data is stored in our “structured” file systems (not to be confused with structured data) just so we can find those files later. But that data moves from location to location. Each new employee has a preferred way of organizing that data. To combat this organizations have created drive mappings, shares and other “locations” for data. Reports are on the R:\ drive, Home folders on the H:\ drive, customer data lives on M:\ drive.
The content in these documents is needed to make informed business decisions and provide further analysis.
The Data Glut
The growth of unstructured data has risen at an exponential rate. IT organizations have been unable to get their hands around this data. How are organizations able to bring order to an unstructured way of handling data? The need is out growing the skills of your typical IT department staffed with support and system engineers. They need to make decisions but what do they base those decisions on? Do they just keep buying more disk? Depending upon your industry you may have files that must be kept private at all costs because they may have private information in them. Does someone in IT get to decide what data is sensitive? Is there a strong information governance program in place to manage the generated data?
Governance is the fix?
A new Information Governance Program could put structure in place to help with the data. They can create policies such as:
- All data is to be archived after 72 months
- All data will be moved to cold storage if untouched for a year
- All data is to be backed up and kept for X years
- All “sensitive” data is to be kept out of home folders and stored on the X:\ drive.
- We do not share data externally with customer information in them
- Only senior level executives have access to X.
Great! New policies from management. How do we implement the new policies to remain compliant? Where do we start? What data do we have on our systems?
Time for File Analysis – I am an Engineer Jim Not a Miracle Worker!
With file analysis you will begin to see the scope of your unstructured data to identify The Where, The What, and the Who. Once you can answer these questions you can begin to implement more of a lifecycle management to your data and start to take control.
The Where (Where oh where can my data be, the helpdesk took it away from me)
In IT we try to create a structure for our data as mentioned earlier in this document. In doing so we create shares on our network. We then have to backup that data and then archive that data. We must make sure our systems can be backed up within the backup window. Once we know where all the data is stored we must analyze where that data is stored to make it efficient. It must be easy for our end users to get to that data and that will make it easier to tell management where our data is.
The Whaaaaaat (says the minion)?
Unstructured data has content and we must be able to search on it to find it. End users search for that data. This data is not in a searchable database. It is stored on servers, shares and in the cloud. I need my TPS reports says the boss who can not find it on the network. We live in a society where we want instant access to our data. For auditing we must be able to do security scans for sensitive files. Can we answer the question of “Is there any sensitive data stored in our users home folders?”. We will need to analyze the content of our data (meta data and content). Google drive is great for indexing the content of your documents and for the most part can tell you who has access to it. Network administrators can look at access to particular files and folders to see who has access but that is a one off security review. How do we know who had access last month? We need to understand more about who has access.
The Who (Not the band you are looking for)
Our data is stored on the network. Someone owns that piece of data. Someone needs access to that data. Others need to be blocked from that data. In an audit we must prove that certain individuals do not have access to data (such as sensitive customer data). If someone is found, we must re-run our audit to prove that the access was taken away. Therefore we must look at the owners and the ACL’s. We must be able to report on specific “sensitive” data.
Solving the Business Case
You are already thinking about the tools you can use and the cool PowerShell scripts you can write to solve some of these problems. You are going to analyze your data, you are going to create better shares and manage your group memberships etc. You are going to identify what is valuable. Here is a plan to get you started before you start down this path on you own.
Step 1. Perform File Analysis to Answer Who, Where and the What
Our file analysis solutions will enable you to scan all your network shares and enable you to provide real reports on file systems including Folder Summary, Duplicate files, owner, quotas, data age etc. You can also provide security reports such as direct assignments, who has access to a path and even show a historical comparison for each (solve the audit problems). You can then report on this to management at any time. Analysis is all about looking at two points in time.
Step 2. Plan for an unstructured information life-cycle
Our solutions for managing unstructured data allow you to via policy create managed folders (user home folders and folders/shares based upon group membership). Think of this as Identity Management for files and folders. This will allow you to create, update and eventually move or destroy folders automatically. During this phase you will also create policies for retention and be able to move unused data to less costly storage.
Step 3. Make sure you are automating the user lifecycle
If you are not automating who has access to your network you are likely not managing who is in a group. Group Memberships are key to who has access to your data. Our solutions can automate the lifecycle of users based upon their roles. Those roles can be used to automatically provision users to groups. If their role changes you can use policy to remove them from a group.
As always if you would like to see our solutions in action please contact us for more information.