Inquiry
Bioinformatics Basics: Data Acquisition and Data Wrangling

Bioinformatics Basics: Data Acquisition and Data Wrangling

Online Inquiry

Introduction to Data Acquisition

Data acquisition refers to the procedure of obtaining information by downloading or transferring files from one location to another. File transfer, FASTA manipulation, manipulating Excel data sheets, and data management are the four primary sections of data acquisition.

The following are some examples of tools or applications that can be utilized to transfer files:

  • Wget: It stands for World Wide Web get and is utilized to download a file from a website or web server using the command line.
  • Globus: Globus helps you manage file transfers, track performance, retry failed transfers, recover from errors instantly when feasible, and report on the progress of your data transfer.
  • iRODS: It transfers data in parallel using numerous threads. This gives iRODS a substantial speed advantage over FTP, which only uses one thread for transfers.
  • SRAtoolkit: The SRA toolkit has been set up to connect to the NCBI SRA and download files using FTP.

The following steps are followed when manipulating FASTA: Evaluate sequence length, transform FASTQ to FASTA, trim FASTQ quality, and extract FASTA sequences using sequence IDs. Manipulation of Excel data sheets, on the other hand, entails the following four major steps: (1) Make a workbook out of several text files, (2) generate an index for all worksheets, (3) combine two spreadsheets using a common column, and (4) export numerous worksheets as separate text files. Finally, data management is where data is transferred in the Sequence Read Archive of the National Center for Biotechnology Information (NCBI-SRA).

Introduction to Data Wrangling

Data wrangling is a time-consuming and iterative process of preparing and enriching data for analysis and visualization. Data preparation is the term for a procedure of data analysis like this. If the data wrangling system' output contains new data or errors, this procedure, like data analysis itself, can be iterative This implies that the program sequences may be repeated until the preferred outcome is obtained. Data wrangling is a simple and straightforward process compared to data curation or data stewardship, which are much more complex and difficult. Data curation is a holistic process that defines the continuous management of data throughout its whole life cycle from creation and first storage to the spot in time when it is archived or outdated and removed for future analysis.

The stages involved in the data wrangling are as follows:

  1. Exploring: The data must provide knowledge in the first step. That is, where does the data originate? What is the structure of the data?
  2. Structuring: This step is necessary to correct and restructure the data in a consistent manner so that it can be used for analysis.
  3. Cleaning: The cleanup step thoroughly cleans up the data to ensure that it contains more information and can be analyzed in a high-quality manner.
  4. Enriching: The goal is to improve existing data by adding more data from other internal or external data sources, or by generating new columns from existing data using calculation rules.
  5. Validating: The validation step will either expose any data quality issues or verify that a suitable transformation has been performed.
  6. Publishing: This is the final step in the publishing process, and it determines how the edited data is delivered to subject analysts and for which software so that they can be used for other reasons.

About CD Genomics Bioinformatics Analysis

The bioinformatics analysis department of CD Genomics provides novel solutions for data-driven innovation aimed at discovering the hidden potential in biological data, tapping new insights related to life science research, and predicting new prospects.

References

  1. Azeroual O. Data Wrangling in Database Systems: Purging of Dirty Data. Data. 2020, 5(2):50.
  2. Chen X, Chang JT. Planning bioinformatics workflows using an expert system. Bioinformatics. 2017, 33(8).
  3. Endel F, Piringer H. Data Wrangling: Making data useful again. IFAC-PapersOnLine. 2015, 48(1).
* For Research Use Only. Not for use in diagnostic procedures.
Online Inquiry