Technologist Arena: Datastage

Showing posts with label Datastage. Show all posts

Monday, August 4, 2014

Some Design tips in IBM Infosphere Information Datastage

1) When you need to run the same sequence of jobs again and again, better create a sequencer with all the jobs that you need to run. Running this sequencer will run all the jobs. You can provide the sequence as per your requirement.

2) If you are using a copy or a filter stage either immediately after or immediately before a transformer stage, you are reducing the efficiency by using more stages because a transformer does the job of both copy stage as well as a filter stage

3) Use Sort stages instead of Remove duplicate stages. Sort stage has got more grouping options and sort indicator options.

4) Turn off Run-time Column propagation wherever it’s not required.

5) Make use of Modify, Filter, and Aggregation, Col. Generator etc stages instead of Transformer stage only if the anticipated volumes are high and performance becomes a problem. Otherwise use Transformer. It is very easy to code a transformer than a modify stage.

6) Avoid propagation of unnecessary metadata between the stages. Use Modify stage and drop the metadata. Modify stage will drop the metadata only when explicitly specified using DROP clause. You may try dropping such unnecessary metadata with other stages, but that would not work as it will carry till end in background, only modify stage help in dropping such metadata.

7) Add reject files wherever you need reprocessing of rejected records or you think considerable data loss may happen. Try to keep reject file at least at Sequential file stages and writing to Database stages.

8) Make use of Order By clause when a DB stage is being used in join. The intention is to make use of Database power for sorting instead of Data Stage resources. Keep the join partitioning as Auto. Indicate don’t sort option between DB stage and join stage using sort stage when using order by clause.

9) While doing Outer joins, you can make use of Dummy variables for just Null checking instead of fetching an explicit column from table.

10) Data Partitioning is very important part of Parallel job design. Use proper partitioning method according to given scenario.

11) Do remember that Modify drops the Metadata only when it is explicitly asked to do so using KEEP/DROP clauses.

12) Range Look-up: Range Look-up is equivalent to the operator between. Lookup against a range of values was difficult to implement in previous Data Stage versions. By having this functionality in the lookup stage, comparing a source column to a range of two lookup columns or a lookup column to a range of two source columns can be easily implemented.

13) Use a Copy stage to dump out data to intermediate peek stages or sequential debug files. Copy stages get removed during compile time so they do not increase overhead

14) Where you are using a Copy stage with a single input and a single output, you should ensure that you set the Force property in the stage editor TRUE. This prevents DataStage from deciding that the Copy operation is superfluous and optimizing it out of the job.

Thursday, July 31, 2014

Understanding Infosphere Metadata Workbench

What is Metadata Workbench?

Metadata Workbench is part of the Information Server suite of products, which all share a common repository and set of services. By default the Information Server Repository is DB2, but Oracle, and other database systems are also supported.

In order to work effectively with Metadata Workbench, we need to have an understanding of Information Server and the suite of products it hosts.

It is a tool for managing metadata; it manages all three types of metadata –

Business – Business rules, Definitions, Terminology, Glossaries, Algorithms and Lineage using business language. Audience: Business users.
Technical – Defines Source and Target systems, their Table and Fields structures and attributes, Derivations and Dependencies. Audience: Specific Tool Users –BI, ETL, Profiling, Modeling.
Operational – Information about application runs: their frequency, record counts, component by component analysis and other statistics. Audience: Operations, Management and Business Users.

It is a product within the Information Server suite of products. Which manages metadata assets produced by the products within Information Server, including –

Mapping specifications created by FastTrack
ETL jobs built in and executed by DataStage
Data quality jobs built in and executed by DataStage
Business terms defined in Business Glossary
Data reports generated by Information Analyzer

It manages metadata assets linked to Information Server metadata assets, including –

Data resources (relational tables, data files, applications) accessed by DataStage ETL jobs and used for Information Analyzer reports
Data modeling tool documents
BI reports

It also enables us to understand the relationships between the different metadata assets through graphs and reports.

Metadata Workbench functionality –

Metadata Workbench supports three different categories of functionality. We can use it to gain information about the metadata within the Repository (Explore).

We can also use it to examine associations and dependencies between the metadata assets (Analyze).

In addition, there are capabilities for capturing additional metadata and then integrating it with other existing metadata (Capture).

The Capture functionality is available within the InfoSphere Metadata Asset Manager (IMAM) tool, which is an Information Server tool that works in conjunction with Metadata Workbench.

This tool is used to capture metadata that is consumed by Information Server applications such as DataStage.

Explore –

Explore metadata assets, including jobs, reports, databases, models, terms, stewards, systems, specifications, data quality rules
Easy navigation of assets
Simple and advanced search capabilities
Robust query builder
Integrated cross-view of Information Server and external linked assets
Graphical view of asset relationships

Analyze –

Analyze dependencies and relationships between metadata assets, including jobs, BI reports, and data models
Trace data lineage through DataStage jobs and to and from databases, jobs, and reports
Assess the impact of change across information assets
Graphical display of data lineage and impact analysis

Capture –

Capture information, relationships, and operational data to enhance information reports and analyses
Use Metadata Asset Manager to integrate external metadata assets with Information Server metadata assets
Extend data linkages to data resources and applications outside of Information Server
Enhance data lineage and impact analysis reports through user-defined linkages

Most of the metadata explored, analyzed, and managed within Metadata Workbench is metadata produced or consumed by Information Server products.

Exploring in Metadata Workbench –

We can use Metadata Workbench to explore all the metadata stored within the Information Server Repository. This includes metadata produced by Information Server applications such as DataStage, as well as metadata capture into the Repository to be consumed by Information Server applications.

All metadata assets are accessible to simple and advanced search capabilities and robust query capabilities. The asset information can be presented in several ways: reports, standard asset window, and a graphic window of information.

Asset information can be enhanced in a number of ways: assets can be linked to business terms, stewards, and labels. Notes can be added to explain and document the assets.

Explore metadata assets including jobs, reports, databases, files, tables, columns, terms, stewards, servers
Simple and advanced search capabilities
Robust query capabilities
Multiple ways to search by asset class, name, property
Save results in various supported formats of reports
View graphs of asset relationships
Create and edit descriptions of assets

Understanding InfoSphere Business Glossary

What is Business Glossary?

The business glossary is a formal contract between the producers and consumers of information across the enterprise.

It is intended to be the artifact or reference that allows anyone to determine the meaning, type and context of any term and, in particular, any business data element used in an initiative.

Business Glossary is a tool for Authoring, Managing, and Sharing Business Metadata. Business Glossary is a tool for business users that enables –

The creation & management of a controlled vocabulary
Collaborative authoring of business metadata

A reference for learning about the information assets of the enterprise –

Meaning
Dependencies
Usage
Quality
Ownership/Responsibility

Benefits of InfoSphere Business Glossary –

Business Glossary provides users with a web-based tool for creating and managing standard definitions of business and organization concepts by using a controlled vocabulary.

The tool divides metadata into categories, each of which contains terms. One can use terms to classify other objects in the metadata repository that are based on the needs of the organization.

One can also designate users or groups as stewards for metadata objects. The result is a system that builds a common language between business and information technology.

Enables data governance

Common language supports compliance regulations such as Basel II
Represent and expose business relationships and lineage
Track history of changes

Accountability and responsibility

Assign stewards as single point of contact

Improved productivity

Administrators can tailor the tool to the needs of their business users
Access enterprise information you need when you need it
Use and re-use information assets based on common semantic hub

Increased collaboration

Capture and share annotations between team members
Greater understanding of the context of information
More prevalent use and reuse of trusted information

Why InfoSphere Business Glossary?

The business glossary organizes metadata into categories that contain terms. Terms can relate to the assets that are stored in the metadata repository or to external assets according to the standards and practices of the enterprise.

One can also designate specific users or user groups as stewards who are responsible for particular assets. Assets in this instance refer to instances of metadata within the metadata repository.

Business Glossary is designed to also provide –

Linkage between business Terms and IT assets for understanding the contexts of IT and business
Designed to answer: Where are the connection points?
Assignment of Data Stewards to Terms and IT Assets
A customizable, publishable set of business Terms

Business Glossary is not designed to be –

A data modeling tool (Rational Architect)
An enterprise architecture hub for reuse of technical metadata by development applications (Information Server)
An enterprise metadata repository (XMETA and Metadata Workbench)

A common vocabulary gives diverse users a common understanding of business concepts, improving communication and efficiency.

For example: one department in an organization might use the word “customer,” a second department might use the word “user,” and a third department might use the word “client,” all to describe the same type of individual.

Business Glossary enables the enterprise to capture these terms, define their meaning, create relationships between them (in the example above, where all three terms have the same meaning, they would be synonyms) and consolidate terminology to achieve increased precision in communications.

Technologist Arena

Monday, August 4, 2014

Some Design tips in IBM Infosphere Information Datastage

Thursday, July 31, 2014

Understanding Infosphere Metadata Workbench

Understanding InfoSphere Business Glossary

About Me

Blog Archive