1) Define Data Stage? A data stage is basically a tool that is used to design, develop and execute various applications to fill multiple tables in. Datastage best practices, FAQ, tips and tricks and sample solutions with real- world examples. + Data Stage Interview Questions and Answers, Question1: Explain Data Stage? Question2: Tell how a source file is populated? Question3: Write the.

Author: Mazuhn Mezinos
Country: Central African Republic
Language: English (Spanish)
Genre: Politics
Published (Last): 28 March 2015
Pages: 78
PDF File Size: 20.12 Mb
ePub File Size: 14.86 Mb
ISBN: 350-2-41575-722-8
Downloads: 61954
Price: Free* [*Free Regsitration Required]
Uploader: Zusar

Learn DataStage interview questions and crack your next interview. It uses a graphical notation to construct data integration solutions and is available in various versions such as the Server Edition, the Enterprise Edition, and the MVS Edition. A Parallel extender in DataStage is the data extraction and transformation application for the parallel processing. There are two types of parallel processing’s are available they are:. A Actually every process contains a conductor process where the execution was started and a section leader process for each processing node and a player process for each set of combined operators and a individual player process for each uncombined operator.

When ever we want to kill a process we should have to destroy the player process and then section leader process and then conductor process.

You can use it to create, edit, load, and run DataStage jobs. We can use the palette to drag and drop connectors and operators on to the designer canvas. We can link nodes by selecting the previous node and dropping the next node, or drawing the link between the two nodes.

We can edit stage properties on the side-bar, and make changes to your schema in Column Properties tab. We can zoom in and zoom out taqs your mouse, and leverage the mini-map on the lower-right of the window to focus on a particular part of the DataStage job. This is very useful dztastage you have a very large job with tens or hundreds of stages. No need to vaqs jobs – You do not need to migrate jobs to a new location in order to use the dataztage web-based IBM DataStage Flow Designer user interface.

No need to upgrade servers and purchase virtualization technology licenses – Getting rid of a thick client means getting rid of keeping up with the latest version of software, upgrading servers, dataastage purchasing Citrix licenses. Easily work with datasatge favorite jobs – You can mark your favorite jobs in the Jobs Dashboard, and have them automatically show up on the welcome page. This gives you a fast, one-click access to jobs that are typically used for reference, saving you navigation time.

Easily continue working where you left off – Your recent activity automatically shows up on the welcome page. This gives you a fast, one-click access to dataastage that you were working on before, so you can easily start where you left off in the last session. Efficiently search any job – Many organizations have thousands of DataStage jobs.

You can very easily find your job with the built-in type ahead Search feature on the Jobs Dashboard. Cloning a job – Instead of always starting Job Design from scratch, you can clone an existing job on the Jobs Dashboard and use that to jump-start your new Job Design. Once you add a source connector to your job and link it to an operator, the operator automatically inherits the metadata.

You do not have to specify the metadata in each stage of the job. Storing your preferences – You can easily customize your viewing preferences and have the IBM DataStage Flow Designer automatically save them across sessions. The job is saved as a DataStage job in the repository, alongside other jobs that might have been created using the DataStage Designer thick client.

Highlighting of all compilation errors – The DataStage thick client identifies compilation errors one at a time.

Large jobs with many stages can take longer to troubleshoot in this situation. IBM DataStage Flow Designer highlights all errors and gives you a way to see the problem with a quick hover over each stage, so you can fix multiple problems at the same time before recompiling. You can refresh the status of your job on the new user interface. You can also view the Job Log, or launch the Ops Console to see more details of job execution. Check Out DataStage Tutorials.


A HBase connector is used to connect to tables stored xatastage the HBase database and perform the following operations:. A Hive connector supports modulus partition mode and minimum maximum partition mode during the read operation. A Kafka connector has been enhanced with the following new capabilities:. Continuous mode, where incoming topic messages are consumed without stopping the connector.

Transactions, where a number of Kafka messages is fetched within a single transaction. Adtastage record count is reached, an end of wave marker is sent to the output link. A File connector has been enhanced with the following new capabilities:.

Data Stage Interview Questions & Answers

A InfoSphere Information Server is capable of scaling to meet any information volume requirement so that companies can deliver business results faster and with higher quality results. InfoSphere Information Server provides a single unified platform that enables companies to understand, cleanse, transform, and deliver trustworthy and context-rich information. A In InfoSphere information server there are four tiers are available, they are: A The client tier includes the client programs and consoles that are used for development and administration, and the computers where they are installed.

A The engine tier includes the logical group of components the InfoSphere Information Server engine components, service agents, and so on and the computer where those components are installed. The engine runs jobs and other tasks for product modules. A The services tier includes the application server, common services, and product services for the suite and product modules, and the computer where those components are installed.

The services tier provides common services such as metadata and logging and services that are specific to certain product modules.

Top 50 Datastage Interview Questions & Answers

The services tier also hosts InfoSphere Information Server applications that are web-based. A The metadata repository tier includes the metadata repository, the InfoSphere Information Analyzer analysis database if installedand the computer where these components are installed. The metadata repository contains the shared metadata, data, and configuration information for InfoSphere Information Server product modules. The analysis database stores extended analysis data for InfoSphere Information Analyzer.

A DataStage provides the elements that are necessary to build data integration and transformation flows. A Stages are the basic building blocks in InfoSphere DataStage, providing a rich, unique set of functionality that performs either a simple or advanced data integration task. Stages represent the processing steps that will be performed on the data.

A A link is a representation of a data flow that joins the stages in a job. A link connects data sources darastage processing stages, connects processing stages to each other, and also connects those processing stages to target systems.

Links are like pipes through which the data flows from one stage to the next. A Jobs include the design objects and compiled programmatic elements that can connect to data sources, extract and transform that data, and then load that data into a target system. Jobs are created within a visual paradigm that enables instant understanding of the goal of the job. A A sequence job is a special type of job that you can use to create a workflow by running other jobs in a specified order.

This type of job was previously called a job sequence. A Fxqs definitions specify the format taqs the data that you want to use at each stage of a job. They can be shared by all the jobs in a project and between all projects in InfoSphere DataStage.

Typically, table definitions are loaded into source stages. They are sometimes loaded into target stages and other stages. A Containers are reusable objects that hold user-defined groupings of stages and links.

Containers create a level of reuse that allows you to use the same dwtastage of logic several times while reducing the maintenance. Containers make it easy to share a workflow, because you can simplify and modularize your job designs by replacing complex areas of the diagram with a single container. A A project is a container that organizes and provides security for objects that are supplied, created, or maintained for data integration, data profiling, quality monitoring, and so on.


A InfoSphere DataStage brings the power of parallel processing to the data extraction and transformation process. InfoSphere DataStage jobs automatically inherit the capabilities of data pipelining and data partitioning, allowing you to design an integration process without concern for data volumes or tim constraints, and without any requirements for hand coding.

A Data pipelining is the process of extracting records from the data source system and moving them through the sequence of processing functions that are defined in the data flow that is defined by the job.

Because records are flowing through the pipeline, they can be processed without writing the records to disk. A Data partitioning is an approach to parallelism that involves breaking the taqs into partitions, or subsets of records. Data partitioning generally provides linear increases in application performance. When you design a job, you select the type of data partitioning algorithm that you want to use hash, range, modulus, and so on.

Then, at runtime, InfoSphere DataStage uses that selection for the number of degrees of parallelism that are specified dynamically at run time through the configuration file.

A A single stage might correspond to a datastae operator, or a number of operators, depending on the properties you have set, and whether you have chosen to partition or collect or sort data on the input link to a stage. At compilation, InfoSphere DataStage evaluates your job design and will sometimes optimize operators out if they are judged to be superfluous, or insert other operators if they are needed for the logic of the job.

A OSH is the scripting language used internally by the parallel engine. A Players are the workhorse processes in a parallel job. There is generally a player for each operator on each node. Players are the children of section leaders; there is one section leader per processing node.

Section leaders are started by the conductor process running on the conductor node the conductor datastabe is defined in the configuration file. How do you decide which one to use? A We should aim to use modular development techniques in your job designs in order to maximize the reuse of parallel jobs and components and save yourself time. A InfoSphere DataStage automatically performs buffering on the links of certain stages.

This is primarily intended to prevent deadlock situations arising where one stage is unable to read its input because a previous stage in the job is blocked from writing to its output.

A The collection library is a set of related operators that are concerned with collecting partitioned data. A The Ordered collector reads all records from the first partition, then all records from the second partition, and faq on.

This collection method preserves the sorted order of an input data set that has been totally sorted. In a totally sorted data set, the records in each partition of the data set, as well as the partitions themselves, are ordered. A The roundrobin collector reads datastaye record from the first input partition, then from the second partition, and so on.

After reaching the last partition, the collector starts over. After reaching the final record in any partition, the collector skips that partition. A The sortmerge collector reads records in an order based on one or more fields of the record. The fields used to define record order are called collecting keys. A aggtorec restructure operator groups records that have the same key-field values into an output record.

A makesubrec restructure operator combines specified vector fields into a vector of subrecords. A makevect restructure operator combines specified fields into a vector of fields of the same type.

Author: admin