Java inefficiency prohibitive for Hadoop?

Thursday August 20th 2009, 11:59 pm
Filed under: tech, web, development, concurrency, scalability

Mass-scale computing: Why Hadoop is hot but Java is not | VentureBeat

First, Java hogs extra resources (cache, memory, CPU cycles), a fact that doesn??t always show up well in single benchmark tests, but does show up clearly when multiple Java benchmarks compete for resources at the same time. This alone loses Java 15%, as tests I ran showed.

On top of that, on average, Java loses to C++ by about 15%, especially when apps can be compiled for mass-scale computing (use profile-guided compilation, etc). So you??re down 30% to start with by implementing a Java Hadoop app on top of the Java-based Hadoop infrastructure. Now, you don??t have to write your app in Java; you can use C++ or even script languages. See for example, why the Hypertable project chose C++. But unfortunately, the choice of Java for infrastructure and the bevy of available libraries is driving many people to use Java for Hadoop. Let??s look at what that means financially as we scale out increasingly.



Free trial for SQL Azure until November

Thursday August 20th 2009, 6:45 pm
Filed under: tech, database, web, cloud

Microsoft Offers Two Database Previews: SQL Server & SQL Azure - ReadWriteEnterprise

SQL Azure is a relational cloud database along the lines of Amazon’s SimpleDB (when it comes to the business model). In other words, it’s keyed towards providing pay-as-you-go scalability at a minimal infrastructure cost.

This trial is a key development for the Azure platform. The free trial lasts until November, after which it’ll cost $9.99/month for 1GB, or $99.99 for 10GB.



Goodbye MapReduce, Hello Cascading

Monday August 17th 2009, 4:19 am
Filed under: tech, concurrency, cloud, scalability

a good post comparing Cascading and Pig.

Goodbye MapReduce, Hello Cascading | Engineering Rapleaf

The most recognizable competing product to Cascading is Pig, a Yahoo technology we also explored. Pig lets you specify batch queries in a neat SQL like syntax, but we found Pig unusable due to the inability to plug in custom input and output formats. One of the nicest things about Cascading is that it doesn¡Çt restrict you in any way ? anything you can do via vanilla MapReduce you can do via Cascading. We like the fact that Cascading flows are all specified via a Java API rather than a SQL like language ? this makes it very natural to create custom functions and very complex workflows. And if some part of your workflow is really performance-critical, Cascading gives you the flexibility to hand-code that part of the workflow with a MapReduce job and plug it in as a custom Flow.



getters/setters considered evil?

Tuesday August 11th 2009, 6:30 am
Filed under: tech, development

a good read.

Go Ahead: Next Generation Java Programming Style | Code Monkeyism

No setters. Many Java developers automatically - sometimes with the evil help of an IDE - write setters for all fields in their classes. You should not use setters. Think about each setter you want to write, are they really necessary for your fields? Better create new copies of your objects if values change. And try to write code without getters either. Tell, don¡Çt ask tells you more about the concept.



Yahoo Extends YQL With Insert, Update, Delete

Wednesday August 05th 2009, 4:54 am
Filed under: tech, web, development, yahoo

must take YQL for a spin.

All the Web’s a Database: Yahoo Extends YQL With Insert, Update, Delete

While the earlier incarnations of YQL were mainly meant to read data, with the addition of these three new SQL verbs, the focus has now shifted towards writing data back to the net as well. Developers can now use YQL to write and modify data on web services and applications.

To explain how useful this can be, the Yahoo team used a few different examples. A developer can now easily use YQL to update a Twitter account (even authentication with OAuth is possible), for example, or add a new comment to a blog post, or insert any data into a remote database. Basically, developers can now use YQL to write data back to any web site that uses forms for data entry and to any API, including authenticated APIs.



Procrun: libraries and applications for Java applications on WIN32

Tuesday August 04th 2009, 9:08 pm
Filed under: tech, development

might look into this later.

Daemon - Daemon : Procrun

Procrun is a set of libraries and applications for making Java applications to run on WIN32 much easier.

Procrun service application

Prunsrv is a service application for running applications as services. It can convert any application to run as a service.

Procrun monitor application

Prunmgr is a GUI application for monitoring and configuring procrun services.



“Hoare logic” or “Why ‘Dijkstra’s definition of structure’ isn’t so crazy”

Monday August 03rd 2009, 3:04 pm
Filed under: tech, development

Edsger Dijkstra and his well-known opposition to the break construction. (via @debasishg)

The Universe of Discourse : Hoare logic

Now consider a more complex block, one of the form:

        if (q) { E; }
        else { F; }
Suppose you believe that code E, given precondition x, is guaranteed to produce postcondition y. And suppose you believe the same thing about F. Then you can conclude the same thing about the entire if-else block: if x was true before it began executing, then y will be true when it is done.[2] So you can build up proofs (or beliefs) about small bits of code into proofs (or beliefs) about larger ones.

We can understand while loops similarly. Suppose we know that condition p is true prior to the commencement of some loop, and that if p is true before G executes, then p will also be true when G finishes. Then what can we say about this loop?

        while (q) { G; }
We can conclude that if p was true before the loop began, then p will still be true, and q will be false, when the loop ends.

BUT BUT BUT BUT if your language has break, then that guarantee goes out the window and you can conclude nothing. Or at the very least your conclusions will become much more difficult. You can no longer treat G atomically; you have to understand its contents in detail.

So this is where Dijkstra is coming from: features like break[3] tend to sabotage the benefits of structured programming, and prevent the programmer from understanding the program as a composition of independent units.



MongoRecord: An ActiveRecord pattern implementation for Mongo DB

Monday July 13th 2009, 12:13 pm
Filed under: tech, database, scalability

checking out the source code right now.

Ruby Language Center - MongoDB - 10gen Confluence

mongo-activerecord-ruby MongoRecord is an ActiveRecord pattern implementation for Mongo DB. The interface forgoes compatibility for RoR ActiveRecord and instead provides a clean interface with no reference to SQL or relational underpinnings.



Some Hive details

Thursday June 25th 2009, 9:24 am
Filed under: tech, database, web, cloud, scalability

These are great posts on Hive.

Perspectives - Hadoop Summit Notes #5 (final): HBase, Rapleave, Hive, Autodesk, Computing in the Cloud, & Future Direction Panel

Data Warehousing use Hadoop
  • Hive is the Facebook datawarehouse
  • Query language brings together SQL and streaming
  • Developers love direct access to map/reduce and streaming
  • Analyst love SQL
  • Hive QL (parser, planner, and execution engine)
  • Uses the Thrift API
  • Hive CLI implemented in Python
  • Query operators in initial versions
  • Projections, equijoins, cogroups, groupby, & sampling
  • Supports views as well
  • Supports 40 users (about 25% of engineering team)
  • 200GB of compressed data per day
  • 3,514 jobs run over the last 7 days
  • 5 engineers on the project
  • Q: Why not use PIG? A: Wanted to support SQL and python.

Hive + Hadoop + S3 + EC2 = It works! ? Joydeep Sen Sarma??s blog

The use case i had in mind was something like this:
  1. A user is storing files containing structured data in S3 - Amazon??s store for bulk data. A very realistic use case could be a web-admin archiving Apache logs into S3 - or even transaction logs from some db
  2. Now this user wants to run sql queries on these files - perhaps to do some historical analysis
  3. Amazon provides compute resources on demand via EC2 - ideally these sql queries should use some allocated number of machines in EC2 to perform the required computations
  4. The results of sql queries that are interesting for later use should be easily stored back in S3



Dumbo: Hadoop streaming made elegant and easy

Thursday June 25th 2009, 8:45 am
Filed under: search, tech, web, cloud, scalability

an alternative for Happy? must look into the details.

Skills Matter : Hadoop User Group UK:Dumbo: Hadoop streaming

Dumbo: Hadoop streaming made elegant and easy At Last.fm, the number of “write once, run never again” Hadoop programs has been growing steadily, especially in the research team. Since Java is a very verbose and compiled programming language, it is not very suitable for writing such programs. A better way to quickly write MapReduce programs is provided by Hadoop Streaming, but it still is less convenient than it could be. Dumbo is a simple enhancement to Hadoop Streaming that addresses this issue. More specifically, it is Python module that makes Hadoop Streaming elegant and easy.

Hadoop User Group ? UK ? Rich Marr¡Çs Tech Blog

Dumbo ? Klass Bosteels

Klass has implemented Dumbo, a system that allows you to write disposable Hadoop streaming programs in Python. The aim was to reduce the amount of work involved writing one-off jobs. This seems like it could become part of the Hadoop toolset as it¡Çs certain to be useful to a lot of people.

Last.fm ? the Blog ? Python + Hadoop = Flying Circus Elephant
The approach described here is the most convenient way of writing Hadoop programs in Python that I could find on the web, but it still wasn¡Çt pythonic enough for my taste. The mapper and the reducer shouldn¡Çt have to reside in separate files, and having to write boilerplate code should be avoided as much as possible. To get rid of these issues, I wrote a simple Python module called Dumbo.


  • Amazon Elastic MapReduce: A Web Service API for Hadoop
  • Data Wrangling Blog The base EC2 images underlyin...
  • Hadoop 0.18 Highlights (Hadoop and Distributed Computing at Yahoo!)
  • Hadoop 0.18 Highlights (Hadoop and Distributed Com...
  • Hadoop + Python = Happy
  • Google Code Happy is a framework for writing ma...
  • Some Hive details
  • These are great posts on Hive. Perspectives -...
  • Site Search Apache Hadoop Wins Terabyte Sort Benchmark
  • Hadoop and Distributed Computing at Yahoo! ...
     






















    Pages (47) : [1]2 3 » ... Last »
    Copyright © turquoise, All Rights Reserved
    Conestoga Street Wordpress Theme by Theron Parlin

    [PR] ã‚ªãƒ¼ãƒ€ãƒ¼ã‚«ãƒ¼ãƒ†ãƒ³ ã‚¢ã‚¹ãƒ™ã‚¹ãƒˆå¯¾ç­– é€ ä½œå®¶å…· éŽæ‰•ã„金 äº¤é€šäº‹æ•… å¼è­·å£« æ ¼å®‰èˆªç©ºåˆ¸ ãƒ¯ã‚¤ãƒ³ã‚»ãƒ©ãƒ¼ å®®å¤å³¶ ç¨Žç†å£« æœ¬æ£š åœ°ã‚µã‚¤ãƒ€ãƒ¼ åˆ¥è˜ ã‚ªãƒ¼ãƒ€ãƒ¼åŽç´ åœ°ã‚µã‚¤ãƒ€ãƒ¼ 通販 ç”Ÿã‘花教室 åœ°ã‚µã‚¤ãƒ€ãƒ¼ 金沢 è²¸ä¼šè­°å®¤ åŠæˆ¸æ£š ãƒãƒ¯ã‚¤ è¶…ミãƒãƒ©ãƒ«æ°´ çŸ³å·çœŒé‡‘沢市ä½å®…建築会社 ã‚¢ã‚¯ã‚·ã‚¹ é‡‘沢市 ä¸å‹•産 è–ªã‚¹ãƒˆãƒ¼ãƒ– æŸšå­å°ç”º ãƒ‘イプ加工 ãƒãƒª ã™ãé–“å®¶å…· çŸ³å·çœŒ ç·åˆè©•価方张キッãƒãƒ³å®¶å…· ãƒãƒ¯ã‚¤ オプショナルツアー ãƒ›ãƒ¼ãƒ ãƒšãƒ¼ã‚¸åˆ¶ä½œ ãƒãƒƒãƒˆãƒ–ック ã‚ªãƒ¼ãƒ€ãƒ¼å®¶å…· ãƒ—レゼント ãƒ“ãƒ«ç”¨å¤–å£æ ã‚«ãƒ¼ãƒ†ãƒ³ è²¸ä¼šè­°å®¤ å®¹é‡ç„¡åˆ¶é™ã‚¹ãƒˆãƒ¬ãƒ¼ã‚¸ ç„¡æ–™ãƒ›ãƒ¼ãƒ ãƒšãƒ¼ã‚¸
    COMMENT OUT Google Adsense CODE (OLD) -->