Ecommerce websites are busy places. Not only are they busy places with customers browsing and buying but, as with any high-street operation, there are a large number of behind-the-scenes operations. These include supply chain ordering, accounts, stock-control and warehouse management, returns and logistics. All of these behind-the-scenes operations involve the sending and receiving of electronic data – often in the form of CSV, XML or other format files. Lots of data comes in, lots of data goes out and everything requires careful management.
Most of the data in these exchanged files encodes information about products; new ones being made available for sale, items that are out of stock, price changes and items that have been purchased and need to be sent off to a customer’s delivery address. The main website will hold data in one particular way, the accounts and CRM systems will hold data in a different way. Each of the supply chain product manufacturers will each have their own way of storing, categorising and managing their own stock while different couriers will again have ways of working that are unique to themselves.
Data wrangling, sometimes called data munging is the task of aligning all these different data formats and structures so that the overall system joins-up correctly. The activities involved includes data transforming and mapping from one type to another.
Related areas include can data cleansing. This topic is commonly viewed as being part of the data wrangling process. In practice almost all data sources will contain errors including duplicate entries, simple keyboard typos from manual data entry or mis-categorised items.
Another component part is ETL (Extract, Transform, Load) which is normally a much simpler set of database to database operations. Historically, ETL was often treated as a separate process whereas today, it is more commonly viewed as part of data wrangling.
While some wrangling tasks can be one off, others can involve setting up a system that will then repeatedly transform one set of data automatically from one format to another. The variety of tasks often mean using different tools; here are a variety of different systems that can be used:
- Mr. Data Converter: Generates HTML, JSON and XML from excel data
- Python and Pandas: Handy for merging, joining, and transforming huge hunks of data with a single Python statement.
- CSVKit: General purpose wranger with good SQL interface.
- Tablula: Converts tables in, for example, a pdf document into general spreadsheets.
- OpenRefine: A powerful set of programmable expressions for complex tasks.
- “R” packages: This is a data scientist’s programming language.
- DataWrangler: A professional package that converts almost anything to anything and includes data cleansing. If you are looking for a single tool then this is probably the one.
According to a recent world-wide market research study, the global forecast for data wrangling will grow from an estimated $1.29 Billion in 2018 to $3.18 Billion by 2023. Their research suggests that the key drivers for this changes will be the use of advanced tools that are making ever greater use of AI and Machine Learning to tie data together. Ecommerce is seen as one of the most important areas that will be utilising this increasingly important technology.