Home | Tech | Unstructured Data Can Create Chaos

Unstructured Data Can Create Chaos

Font size: Decrease font Enlarge font
Unstructured Data Can Create Chaos

It seems that no matter where you go these days (twitter, your favorite tech blog, email newsletters,) that everyone is talking about Big Data.

Let’s face it, Big Data is a buzzword, or buzz term, that many technology professionals are being forced to address.  Most database guys like myself have been dealing with large amounts of data for years.  What was once kilobytes of data turned into megabytes, then gigabytes, then terabytes and now even beyond that.  We’ve dealt with this sizable data in a lot of different ways including table partitioning, regular archival and purging, and the creation of data warehouses that are away from our regular transactional databases.  We’ve had the time to analyze what is coming into our databases so we can transform it into something useful.  This latest wave of “big data” is taking some of these approaches away from us for a couple of different reasons: velocity and volume.

At some point, the size of the data becomes just too big to handle and the speed at which it is coming at us is too quick for our systems to handle.  Now, fast forward to the wonderful world of unstructured data.  This world states that we really don’t care what the data that comes in looks like we’ll just store it.  Then after awhile we’ll be able to do something useful with it.  But just how realistic is this approach?  As a database professional, I like to ensure data quality.  By introducing unstructured data into my world you’ve thrown a lot of my ability to ensure data quality out the window.  I can store it for you.  I might even be able to query a lot of it and produce useful insight from it but over time the data just becomes more and more difficult to manage.

For example, once I’ve traversed the last 2 years of web logs and created a dashboard of how often our customers go to each of our web pages, do I keep the detail information just in case I might come up with a new way to traverse and create new business knowledge?  If I do keep it, do I tie back my new business knowledge to the rows of unstructured data for purposes of drill down?  In some shops this may be impossible.  My only real option may be to archive it because while I’m analyzing the bulk unstructured data that is stored let’s not forget that all my current customers are quickly producing mounds of new data that I’ll have to do something with sooner or later.

To be fair, vendors are giving us ways to deal with this data.  Newer, open source, database technologies such as noSQL and CouchDB (a derivative of NoSQL) are document based solutions.  The Hadoop File system (HFS) provides file based storage that is, in theory, easy to get to and designed to store bulk data.  Developers are slapping SQL like interfaces like Hive on top of HFS in order to facilitate those of us with SQL skills access to the data in these new systems.  But wait, if it is in fact truly unstructured, how do I know what I need?  If data is coming in from multiple sources and just dumping away into an open file system how do I make sense of it?


Join PRESIDENT&CEO on LinkedIn

Subscribe to comments feed Comments (0 posted)

total: | displaying:

Post your comment

  • Bold
  • Italic
  • Underline
  • Quote

Please enter the code you see in the image: